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© A multistage interconnect network (14) cap- 
able of supporting massive parallel processing, 
including point-to-point and multicast com- 
munications between processor modules (12) 
which are connected to the input and output 
ports (20) of the network includes interconnec- 
ted switch nodes (16) arranged in 2 pog b N] 
stages, wherein b is the number of switch node 
input/output ports (36,38), N is the number of 
network input/output ports (20) and [log b N] 
indicates a ceiling function providing the smal- 
lest integer not less than log b N. The additional 
stages provide additional paths between net- 
work input ports (20) and network output ports 
(20), thereby enhancing fault tolerance and les- 
sening contention. 
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This invention relates to multi-stage interconnect networks suitable for multiprocessor computer systems. 

Parallel processing is considered an advantageous approach for increasing processing speeds in computer 
systems. Parallel processing can provide powerful communications and computer systems which can handle 
complex problems and manipulate large databases quickly and reliably. 
5 One example of parallel processing can be found in U.S. Patent No. 4,412,285. This patent describes a 

system using a sorting network to intercouple multiple processors so as to distribute priority messages to all 
processors. 

Communication within parallel processing systems is typically classified as either tightly coupled wherein 
communication occurs through a common memory or loosely coupled wherein communication occurs via 
10 switching logic and communications paths. Various topologies and protocols for loosely coupled processors 
have been proposed and used in the prior art. 

Examples of known interconnection topologies include single stage networks and multistage interconnect 
networks (MINs). 

A single stage network has one stage of switching elements such that information can be re-circulated until 
15 it reaches the desired output port. A MIN has a plurality of switching element stages capable of connecting any 
input port of the network to any output port. 

In general, MINs consist or several stages of switching elements or switch nodes that are wired together 
according to a regular pattern. Typically, each switch node is a small crossbar switch that usually has an equal 
number of inputs and outputs, e.g., a b x b switch node. Prior art MINs typically consist of log b N stages, wherein 
20 b is the number of input/output ports of a switch node, and N is the number of input/output ports of a network. 
Typically, such MINs are constructed Prom the smallest number of links and switch nodes that allows any net- 
work input port to be connected to any network output port. 

Prior attempts at implementing MINs suffer from several disadvantages. One disadvantage arises because 
each network input/output port pair typically has only one way to be connected, thereby making the MIN sus- 
25 ceptible to internal contention. Internal contention occurs when two paths require the same link even though 
the paths may or may not be to the same network output port 

Another disadvantage is lessened reliability due to the number and complexity of components. If a fault 
occurs, it is often difficult to determine where the problem lies. Further, it may be impossible to reconfigure the 
system to exclude the failed component or service the system without shutting it down, thereby leaving the sys- 
30 tern inoperable until the problem is corrected. 

Another disadvantage is complex, expensive, and time-consuming manufacturing and installation require- 
ments. For large network configurations, cabling may be unmanageable due to the logistics of making sure ev- 
ery component is correctly cabled and plugged into the correct connector. 

Still another disadvantage involves diminising bandwidth. The bandwidth available to each processor tends 
35 to decrease as the system size grows. 

According to one aspect of the present invention, there is provided a communications system, including a 
multistage interconnect network comprising a plurality of interconnected active logic switch nodes, character- 
ized by diagnostic means for detecting and reporting any errors that occur within the network, and for isolating 
the errors without propagating them; and reconfiguration means for reconfiguring the network when an error 
40 is detected, without interrupting communications in the system. 

Other aspects of the present invention are specified in the independent claims included in the set of claims 
appended to the present description. 

An embodiment of the present invention will now be described by way of example, with reference to the 
accompanying drawings, in which:- 
45 Figure 1 illustrates the components of the present invention, which comprises a general purpose multipro- 
cessor computer system capable of massive parallel processing; 

Figure 2 provides one example illustrating how the switch nodes are interconnected to implement a net- 
work; 

Figure 3 illustrates the permutation function between stage 0 and stage 1 for a network having between 9 
so and 64 network I/O ports; 

Figure 4 describes the components of an 8 x 8 switch node according to the present invention; 
Figure 5 is a block diagram describing the components of the controllers that connect each PM to the net- 
works; 

Figure 6 describes a Type-A board used in the network; 
55 Figure 7 describes a Type-B board used in the network; 
Figure 8 describes a Type-C board used in the network; 

Figure 9 illustrates a network comprising a single Communication Module Assembly (CMA/A), which sup- 
ports between 2 and 64 network I/O ports; 
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Figure 10 describes circuit switching within the CMA/A wherein a Universal Wiring Pattern (UWP) between 
stage 0 and stage 1 switch nodes is embedded in a backplane; 

Figure 1 1 illustrates a network 14 having CMA/As and CMA/Bs, which support between 65 and 512 network 
I/O ports; 

5 Figure 12 illustrates a network 14 having CMA/As and CMA/Cs, which support between 65 and 4096 net- 

work I/O ports; 

Figures 13 (a) and (b) illustrate a cable harness assembly; 

Figure 14 illustrates a practical implementation of the cable harness assembly shown in Figures 13 (a) and 
(b); 

10 Figure 15 shows a simplified wiring diagram describing how the switch nodes are connected in a network 

having 128 network I/O ports; 

Figures 16 (a), (b), (c) and (d) provide simplified wiring diagrams describing the expansion from 64 PMs 
12 to 65-128 PMs; 

Figure 17 shows the cabling for the situation in which there are 512 network I/O ports in the network; 
15 Figure 18 shows the cabling for the situation in which there are more than 512 network I/O ports in the 

network; 

Figure 19 shows the cabling for the situation in which there are 1024 network I/O ports in the network; 
Figure 20 shows the largest possible configuration of 4096 network I/O ports using eight cabinets to house 
the network; 

20 Figure 21 is a flow chart describing the steps required for configuring the network; 

Figure 22 is a flow charge describing the steps required 25 for reconfiguring the network when a fault oc- 
curs; 

Figure 23 illustrates the paths traversed through the network by a monocast connect command; 
Figure 24 illustrates the software tasks executed by the network controllers; 
25 Figure 25 illustrates the paths traversed through the 5 network by a multicast connect command; 

Figure 26 illustrates one possible application of the present invention, which comprises a general purpose 
multiprocessor computer system capable of massive parallel processing. 

I. GENERAL DESCRIPTION 

30 

Figure I illustrates the components of the present invention, which comprises a general purpose multipro- 
cessor computer system 10 capable of massively parallel processing. The components illustrated in Figure I 
include processor modules (PMs) 12, networks 14, switch nodes 16, controllers 18, network I/O ports 20, optical 
transceivers 22, optical fibers 24, Transparent Asynchronous Transceiver Interface (TAXI) transceivers 26, re- 

35 dundant master clocks 28, bounce-back-points 30, forward channels 32, and back channels 34. 

The PMs 12 are common platform processor modules which communicate with each other by means of 
redundant networks 14. However, it is envisioned that the network 14 of the present invention could be used 
for communications purposes in a large number of different applications. Thus, those skilled in the art will rec- 
ognize that any number of agents of various types, e.g., memory devices, peripheral devices, etc., could be 

40 substituted for the PMs 12 shown. 

The system 10 may use redundant networks 14 (labeled network A and network B in Figure 1) for enhanced 
fault tolerance and increased bandwidth. If one of the networks 14 is not available, then another network 14 
can take over, to allow for graceful degradation of the system 10 in the presence of malfunctions. Software 
executed by the PMs 12 handles the added complexity of redundant networks 14 and automatically load levels 

45 between operative networks 14. The software also supports fault detection and switching in the event of a failure 
of one of the networks 14. 

Each network 14 is a multistage interconnect network 14 (MIN) that employs active logic switch nodes 16. 
In the preferred embodiment, the switch nodes 16 have eight input ports which can be connected to any of eight 
output ports to effectuate the switching functions. (In the following description, the term "switch node 16 I/O 

so port" is often used to refer to a pair of corresponding, i.e., similarly numbered, input and output ports of a switch 
node 1 6.) A plurality of switch nodes 1 6 are interconnected together in a plurality of stages to provide the paths 
between the network input ports and the network output ports. (In the following description, the term "network 
I/O port 20" is often used to refer to a pair of corresponding, i.e., similarly numbered, input and output ports of 
a network 14. Typically, a network I/O port 20 will interface to one PM 12, although this is not required to practice 

55 the present invention.) 

In the preferred embodiment, there are more than [ log b N ] stages in the network 14, wherein b is the num- 
ber of I/O ports of a switch node 16, N is the number of network I/O ports 20, and [ log b N ] indicates a ceiling 
function providing the smallest integer not less than log b A/. (Typically, a switch node 16 will have the same 
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number of input ports and output ports, although this is not required to practice the present invention. If the 
number of input ports and output ports is not identical, then the above equation would become log (a>b) W, wherein 
a is the number of switch node 1 6 input ports and b is the number of switch node 1 6 output ports. ) The additional 
stages provide additional communication paths between any network input port and network output port, there- 

5 by enhancing fault tolerance and lessening contention. 

Each network 14 is logically full-duplex. The bandwidth of the network 14 is not limited by the bandwidth 
of any particular switch node. In fact, the bandwidth of the network 14 increases as the number of network I/O 
ports 20 increases due to the increased number of paths between switch nodes 16. Functionally, the network 
14 provides a plurality of possible interconnection paths for a circuit, from a sending PM 12 to a set (one or 

10 more) of receiving PMs 12. 

Each network 14 automatically detects and reports any errors that occurred during operation, even if there 
is no traffic. The network 14 is able to detect and isolate errors automatically without propagating them, which 
improves serviceability. The network 14 can be automatically reconfigured when a fault is detected, without 
interrupting the operation of the system 10, and minimizing performance degradation after reconfiguration. 

15 Communications between the PMs 12 are conducted in two basic modes: point-to-point and multicast. In 
point-to-point communications, a PM 12 transmits a connect command to another PM 12. The connect com- 
mand travels through a forward channel 32 in the network 14 to the receiving PM 12. The receiving PM 12 re- 
turns a reply to the sending PM 12 through a back channel 34. Once the connection is made to the receiving 
PM 12, the sending PM 12 transmits its messages, and then terminates the connection when the transmission 

20 is done. The network 14 will support many of such point-to-point communications, between different pairs of 
PMs 12 at the same time. In the absence of conflicts, all PMs 12 could communicate at the same time. 

In the second, or multicast, mode of communications, a single PM 12 can broadcast a message to all of 
the other PMs 12 or a predefined group of PMs 12. The predefined groups of PMs 1 2 are called "superclusters 0 
and multicast commands within different superclusters can occur simultaneously. The sending PM 12 transmits 

25 its multicast command which propagates through the forward channel 32 to all of the PMs 12 or the group of 
PMs 12. The PMs 12 that receive multicast messages reply to them by transmitting, for example, their current 
status through the back channel 34. The network 14 can function to combine the replies in various ways. 

Each PM 12 has at least one separate controller 1 8 for interfacing to each network 14. There is no limit on 
the number of controllers 1 8 that connecta PM 12 to a network 14 if additional bandwidth is desired. Transparent 

30 Asynchronous Transceiver Interface (TAXI) transceivers 26 are used to serialize and de-serialize data for trans- 
mission between the controllers 18 and the network 14 over optical fiber 24. The TAXI transceivers 26 convert 
parallel data into a high speed serial form that encodes clock information into the data stream, and vice versa. 
The controller 18 outputs a forward channel 32 consisting of eight bits of data plus a single bit parity, and a 
one bit back channel 34 associated with the receive channel to the TAXI transceiver 26. The controller 18 re- 

35 ceives a forward channel 32 consisting of eight bits of data plus a single bit of parity and a one bit back channel 
34 associated with the transmit channel from the TAXI transceiver 26. The TAXI transceiver 26 converts the 
1 0 bits of parallel data into bit serial data. The TAXI transceiver 26 converts the bit serial data back into 1 0 bits 
of parallel data and recovers the clock. The back channels 34 are only one bit so they can interface to the TAXI 
transceivers 26 with the forward channels 32, thus providing more efficient packaging. 

40 

2. NETWORK TOPOLOGY 

Figure 2 provides one example illustrating how the switch nodes 1 6 are interconnected to implement a net- 
work 14. In the preferred embodiment, the 8 x 8 switch nodes 16 are arranged in 2 [ log B N ] stages, wherein 

45 N is the number of network I/O ports 20 and [ log 8 N ] indicates a ceiling function providing the smallest integer 
not less than log 8 N. Thus, for a network 14 having 8 or less network I/O ports 20, there are 2 log 8 8 = 2 stages; 
for a network 14 having between 9 and 64 network I/O ports 20, there are 2 log 8 64 = 4 stages; for a network 
14 having between 65 and 512 network I/O ports 20, there are 2 log 8 512 = 6 stages; and for a network 14 
having between 513 and 4096 network I/O ports 20, there are 2 log 8 4096 = 8 stages. The additional stages 

so provide additional communication paths between any network input port and network output port, thereby en- 
hancing fault tolerance and lessening contention. 

As indicated in Figure 2, the stage numbers increment from left to right beginning at 0, until a ''bounce- 
back point" 30 is reached, at which point the stage numbers decrement from left to right back to 0. The bounce- 
back point 30 indicates the point where the stages of the network 14 are physically folded. Folding the network 

55 14 allows corresponding switch nodes 16 in similarly numbered stages on either side of the bounce-back point 
30 to be located adjacent to each other to simplify packaging and to minimize signal path lengths (especially 
to/from the PMs 12). The folded network 14 is illustrated by Figure 1 , and Figures 6, 7, and 8 described further 
hereinafter in conjunction with Type-A, -B, and -C boards. 

4 



BNSDOCID:<EP 0511834A2 I > 




EP0 511 834 A2 

Each 8x8 switch node 16 used in the preferred embodiment has eight input ports and eight output ports, 
wherein each port interfaces to a 9-bit (8-bits of data and 1 bit of parity) forward channel 32 and a 1-bit back 
channel 34. (For the sake of brevity and clarity, however, Figure 2 represents each forward channel 32 and 
back channel 34 pair with a single line, wherein the direction of the forward channel 32 is indicated by an arrow 

5 and the direction of the back channel 34 is opposite the arrow). 

Within any 8x8 switch node 16, any input port can be connected to any output port by the function of the 
logic within the switch node 1 6. Up to eight PMs 12 may be applied to the eight input ports of each switch node 
16 in the "left" stage 0 switch nodes 16 on the left side of the bounce-back point 30 in Figure 2; these are the 
network input ports. Each of the output ports from the "left" stage 0 switch nodes 1 6 communicate bidirectionally 

10 with a different one of the "left" stage 1 switch nodes 16 on the left side of the bounce-back point 30 in Figure 
2, so that any one of the "left" stage 0 switch nodes 16 can communicate with any one of the "left" stage 1 
switch nodes 16. (For the sake of brevity and clarity, however, Figure 2 shows only a portion of the intercon- 
nections between switch nodes 16). Each of the output ports from the "left" stage 1 switch nodes 16 commu- 
nicate bidirectionally with a corresponding "right" stage 1 switch node 16 on the right side of the bounce-back 

15 30 point in Figure 2. Each of the output ports from the "right" stage 1 switch nodes 16 communicate bidirec- 
tionally with a different one of the "right" stage 0 switch nodes 16 on the right side of the bounce-back point 30 
in Figure 2, so that any one of the "right" stage 1 switch nodes 1 6 can communicate with any one of the "right" 
stage 0 switch nodes 16; these are the network output ports. Thus, any PM 12 connected to a "left" stage 0 
switch node 16 can communicate with any PM 12 connected to a "right" stage 0 switch node 16 by appropriate 

20 switching of the stage 0 and stage 1 switch nodes 16. 

The pattern of interconnections between the stage 0 and stage 1 switch nodes 16 in Figure 2 is termed a 
Universal Wiring Pattern (UWP). This pattern is "universal" because the interconnections between different sta- 
ges in any size network 1 4 consists of one or more copies of the UWP. (Note that the pattern of interconnections 
between similarly numbered stages, i.e., across the bounce-back point 30, is not a UWP, but instead consists 

25 of a "straight" interconnection wherein the output ports of a switch node 16 communicate bidirectionally only 
with the input ports of a corresponding switch node.) 

For a network 14 of size N = 8", n > 1, wherein n indicates the number of stages in the network and N in- 
dicates the number of network I/O ports 20 and thus the number of PMs 12 that can be attached thereto, the 
number of copies of the UWP between each stage is 8°- 2 . 

30 For 8 or less network I/O ports 20 (n=1 ), there is only one stage and thus no UWP. 

For 9 to 64 network I/O ports 20 (n=2), there is one (8 2 - 2 ) copy of the UWP between each pair of stages. 
For 65 to 512 network I/O ports 20 (n=3), there are eight (8 s - 2 ) copies of the UWP between each pair of 
stages. In the preferred embodiment, the patterns do not overlap between Stages 0 and 1; the patterns are 
stretched out and overlap between Stages 1 and 2. 

35 For 513 to 4096 network I/O ports 20 (n=4), there are 64 (8*~ 2 ) copies of the UWP between each pair of 

stages. In the preferred embodiment, the patterns do not overlap between Stages 0 and 1; the patterns are 
stretched out and overlap between Stages 1 and 2; the patterns do not overlap between Stages 2 and 3. 

The UWP is a function of the switch node 1 6 size and is generated by a permutation function that identifies 
which ports to connect between switch nodes 16 in different stages. Mathematical properties of these inter- 

40 connections simplify cabling in the network 14. 

Because 8x8 switch nodes 16 are used, the number of network I/O ports 20 is N = 8 n , ne{1,2, 3,...}, 
and there are n Stages numbered from 0 to n - 1 . The switch nodes 1 6 in each Stage are numbered from top 
to bottom from 0 to N/8 - 1 . The input/output ports of the switch nodes 16 in each Stage can be numbered from 
top to bottom from 0 to A/ - 1, which are the ports' Levels. The ports on each side of a given switch node 16 

45 are numbered from 0 to 7 from top to bottom. 

There are two ways to reference a specific input/output port on a specific switch node 16. The first method 
is by (Stage : Level) and the second is by the triplet (Stage : Switch-Node-Number : Switch-Node-Port- 
Number). For example, in a network 14 of N = 512 network I/O ports 20 (n=3), let S be the Stage number and 
Xbe the Level number, wherein Xis an arbitrary number, 0 £X< N, represented using octal digits as: x„_ 1 ...x 1 x 0> 

so where 0 <. x, < 8 and 0 <, i < n. Therefore, (S : x 2 x^Xq) is the reference by the first method and (S : x 2 Xi : x 0 ) is 
the reference by the second method. 

It can be shown that the pattern of connections between each Stage is completely specified by permuting 
the digits of the Level number. In the general case, for all X, 0 £ X < N, the total set of switch node 16 output 
ports numbered (S : x n .^...x 1 x 0 ) are connected to the switch node 16 input ports (S+1 : PERMUTE n 8 {x„_ 

55 i.^XiXo}). The permutation function is subscripted with an "S" to indicate that the function is associated with a 
specific Stage, and typically, is different in each Stage. The "n" superscript refers to the number of Stages in 
the network 14. 

For a network 14 of 8 or less network I/O ports 20 (n=1) there is no permutation function, because only 
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two Stage 0 switch nodes 16 are used. 

For a network 14 of between 9 and 64 network I/O ports 20 (n=2) there is only one possible permutation 
function between Stage 0 and Stage 1: PERMUTE^,^} = XqX,. To see how this works, examine Figure 3. 
The Level numbers are shown at the ports on the extreme left and right sides of Figure 3. Consider the second 
5 output from switch node 16 #3 in Stage 0, i.e., (0:3:1). It is at Level 25 10 which is 31 8 . To calculate which input 
it should be connected to in Stage 1, reverse the octal digits to obtain 13 8 which is Level 11 10 . This process 
can be repeated for each Level from 0 to 63 to obtain a table enumerating the connections. 

Fora network 14 of between 65 and 512 network I/O ports 20 (n=3), two permutation functions are needed: 
PERMUTE 3 0 {x 2 x 1 x 0 } = X2XQX1 and PERMUTE 3 , {x 2 x 1 x 0 } = x 1 x 0 x 2 . To see the effect of this sequence of permu- 
te tation functions, examine its effect on the octal number 21 0 8 . This number is chosen to illustrate where the digits 
are mapped at each Stage in the network 14. 210 is mapped by PERMUTE 3 0 to 201 ar| d that is then mapped 
by PERMUTE^ to 012. The permutation function is chosen so that each digit number (e.g., 0, 1, and 2) appears 
in the least significant position once. Clearly, these permutation functions meet the condition (notice the un- 
derlined digit). This condition guarantees that every network I/O port 20 will have a path to every other network 
15 I/O port 20. Another PERMUTE 3 , function that could be used with the given PERMUTE 3 0 function is PER- 
MUTE 3 1 {x 2 x 1 x 0 } = x 0 x 1 x 2 . This would produce the mappings 210 to 201 to 10 ? which meets the constraint. If 
either PERMUTE 3 , function were exchanged with the PERMUTE 3 0 function, the respective inverse networks 
14 would be obtained. 

The topology specified by PERMUTE 3 0 and PERMUTE 3 , should be thought of as the virtual network 14 
20 topology. Due to the mapping capabilities of the switch nodes 16, discussed further hereinafter, the physical 
cabling will not necessarily match this topology. The network 14, however, behaves as though it does have 
this topology. 

In the preferred embodiment, it is also necessary to consider the topology of a network 14 of 4096 network 
I/O ports 20 (n=4). This requires three permutation functions: PERMUTE 4 0 {x3X 2 x,x 0 } = x 3 x 2 x 0 x 1 , PER- 

25 MUTE 4 1 {x 3 x 2 x 1 x 0 } = x,XoX 3 x 2 , and PERMUTE^x^xiXo} = x 3 x 2 x 0 x 1 . This sequence of permutation fonctions 
maps octal 321 0 8 to 3201s to 0132s to 0123s. Again, notice that each digit appears in the least significant pos- 
ition once. The reason this set of functions is chosen is because PERMUTE 4 0 and PERMUTE 4 2 leave the most 
significant two digits undisturbed. The physical consequence of this is to minimize the cable length in those 
two Stages. In the worst case, the distance between an output from one Stage to the input of the next Stage 

30 can be no greater than 64 Levels. For example, examination of Figure 3 shows the worst case length to be 
from Level 7 to Level 56. Note that a network 14 of 4096 network I/O ports 20 would contain 64 copies of Figure 

3 in Stages 0 and 1 and another 64 copies would make up Stages 2 and 3. PERMUTE 4 , would specify the 
interconnection between the two sets of 64 subnetworks. 

35 3. SWITCH NODES 

Figure 4 describes the components of an 8 x 8 switch node 16 according to the present invention. Figure 

4 shows the basic circuitry required for communications from left to right through 9-bit forward channels 32, 
and for receiving and transmitting, from right to left, serial replies through 1-bit back channels 34. To implement 

40 a "folded" network 14, a duplicate but reversed 8x8 switch node 16 having the elements shown in Figure 4 is 
required for communications from right to left through 9-bit forward channels 32, and for receiving and trans- 
mitting, from left to right serial replies, through 1-bit back channels 34. 

The organization of the switch node 16 is modular; there are eight identical copies of the input port logic 
(IPL) 36 and eight identical copies of the output port logic (OPL) 38. Each switch node 16 is a crossbar so that 

45 each input port can be connected to any of the output ports. Each input port receives a forward channel 32 
comprising eight bits of parallel data and one bit of parity; each input port transmits a back channel 34 com- 
prising one bit of serialized data. Each output port receives a back channel 34 comprising one bit of serialized 
data; each output port transmits a forward channel 32 comprising eight bits of parallel data and one bit of parity. 
Each IPL 36 is comprised of the following logic components, which are described further hereinafter: hard 

so carrier timer 44, input FIFO 46, command/data latch 48, tag latch 50, command decode 52, parity check 54, 
input state control 56, output port select 58, data select mux 60, feedback select 62, command generator 64, 
input status register 66, back channel mux 68, reply generator 70, port level register 72, back channel output 
mux 74. Each OPL 38 is comprised of the following logic components, which are described further hereinafter: 
hard carrier logic 84, hard carrier timer 86, output status register 92, parity check 94, output state control 96, 

55 8-input arbiter 98, path select 100, output mux 102, output latch 104, command generator 106, reply decode 
110, receive FIFO 112, back channel FIFO 114, clock select 116. In addition, the switch node 16 comprises 
the following logic components, which are described further hereinafter hard carrier timer generator 88, hard 
carrier timeout value register 90, all out busy monitor 118, merge logic 120, diagnostic port logic (DPL) 122, 
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back channel interface 124, diagnostic port interface (DPI) 126, read/write control register 128, multicast port 
select register 130, tag mapping table 108, and chip address register 121. 

Within the IPL 36, the input state control 56 constantly monitors the input on the forward channel 32 for 
the periodic presence of hard carriers, which indicates that the input port is connected to another switch node 

5 16 or a TAXI transceiver 26. If the forward channel 32 input is directly interfaced to the TAXI transceiver 26, 
the presence of a hard carrier is indicated by a strobe of a CSTRBI signal 42 generated by a TAXI transceiver 
26. If the forward channel 32 input is directly interfaced to another switch node 16, the presence of a hard carrier 
is indicated by the reception of a hard carrier escape code. Upon receipt of a hard carrier, a hard carrier timer 
44 in the IPL 36 loads in two times the count value from a hard carrier timeout value register90. The hard carrier 

10 timer 44 then counts down and another hard carrier must be received prior to the counter reaching zero; other- 
wise a hard carrier lost flag is set in the input status register 66. If the input port is not directly interfaced with 
a TAXI transceiver 26, the hard carrier timer 44 for the back channel 34 is disabled. 

Within the OPL 38, the output state control 96 constantly monitors the input from the back channel 34 for 
the periodic presence of a hard carrier whenever it is directly interfaced to a TAXI transceiver 26. The presence 

15 of the carrier is indicated by a strobe of a CSTRBI signal 42 generated by the TAXI transceiver. Upon receipt 
of a hard carrier, a hard carrier timer 86 in the OPL 38 loads in two times the count value from a hard carrier 
timeout value register 90. The hard carrier timer 86 then counts down and another hard carrier must be received 
prior to the counter reaching zero; otherwise a hard carrier lost flag is set in the output status register 92. If the 
output port is not directly interfaced with a TAXI transceiver 26, the hard carrier timer 86 for the back channel 

20 34 is disabled. 

The OPL 38 also maintains the presence of a hard carrier on a forward channel 32 output. If there is no 
circuit active, the OPL 38 generates a hard carrier every time it receives a signal from the hard carrier timer 
generator 88, and upon reaching zero, the hard carrier timer generator 88 is reloaded from the hard carrier time- 
out value register 90. If a circuit is established, the OPL 38 generates a hard carrier whenever the IPL 36 to 

25 which it is connected receives a hard carrier. If the forward channel 32 output is directly interfaced to another 
switch node 1 6, the hard carrier that is generated takes the form a hard carrier escape code. If the forward chan- 
nel 32 output is directly interfaced to a TAXI transceiver 26, the hard carrier is generated by the TAXI transceiver 
26 as a result of not receiving anything from the switch node 16 OPL 38 forward channel 34 for one cycle. 
When no circuit is established or pending, the switch nodes 16 and sending controllers 18 always generate 

30 a continuous stream of soft carrier commands. The controllers 1 8 and switch nodes 1 6 always expect to receive 
the soft carrier when there is no circuit established or pending. If the soft carrier or another legal command is 
not received immediately, a soft carrier loss error is reported by setting the appropriate bit of an input status 
register 66. 

When a circuit is connected, pending connect, or pending disconnect, switch nodes 16 and controllers 18 
35 always expect to receive an idle command when nothing else is expected. If an idle command or another legal 
command is not received, the forward channel loss bit or an idle loss error bit is set in the input status register 
66. 

4. NETWORK CONTROLLERS 

40 

Figure 5 is a block diagram describing the components of the controllers 18 that connect each PM 12 to 
the networks 14. A controller 18 comprises of a SPARC™ microprocessor 56 controlling the transfer of data 
through an input/output processor (IOP) 58. The IOP 58 communicates directly with a system bus 136 con- 
nected to the PM 12 and with the network 14 via phase locked TAXI transmitters 148 and receivers 150, and 

45 an optical transceiver 22. The TAXI transmitters 148 and TAXI receivers 150 are used to serialize and de-ser- 
ialize data for transmission over optical fiber 24. 

The controller 18 outputs a forward channel 32 consisting of eight bits of data plus a single bit parity, and 
a one bit back channel 34 associated with the receive channel to the TAXI transmitter 148. The controller 18 
receives a forward channel 32 consisting of eight bits of data plus a single bit of parity and a one bit back channel 

so 34 associated with the transmit channel from the TAXI receiver 1 50. The TAXI transmitter 148 converts the 10 
bits of parallel data into bit serial data that encodes clock information into the data stream. The TAXI receiver 
150 converts the bit serial data back into 10 bits of parallel data and recovers the clock. Each TAXI transmitter 
148 on the controller 18 derives its clock input from the clock output of the TAXI receiver 150 via the phase 
locked loop 146. This allows each controller 18 to maintain synchronization to a master clock 28 distributed 

55 via the network 14. 
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5. DIAGNOSTIC PROCESSORS 

As shown in Figure 5, every controller 18 (and boards in Figures 6, 7, and 8) is interfaced to a diagnostic 
processor (DP) 140. There is one DP 140 per physical board that is interfaced to all the components on that 

5 board. All the DPs 140 are interconnected using a local area network (LAN) 144. During system startup, the 
DPs 140 have the ability to run self tests on the components and perform any initialization that is needed. During 
normal operation, the DPs 140 can respond to error conditions and facilitate logging them. Those DPs 140 that 
are interfaced to switch nodes 16 also participate in the process of reconfiguring the network 14 when errors 
are detected. A switch node 16 may detect numerous faults including parity errors, hard carrier loss, data over 

10 runs, back channel 34 loss, forward channel 32 loss, soft carrier loss, null loss, idle loss, FIFO errors, violation 
errors, tag errors, command/reply errors, time outs, and merge errors. 

Referring again to Figure 4, the diagnostic port interface (DPI) 126 in the diagnostic port logic (DPL) 122 
of each switch node 1 6 allows the DP 140 to perform two types of activities within the switch node 1 6, i.e., read- 
ing and writing selected registers and sending information out any back channel 34 output. When the command 

15 decode 52 and the IPL 36 detects the presence of a DP 140 command or datum, it stores the command in the 
command/data and tag latches 48 and 50, and signals the DP 140 via the DP1 126. Using the DPI 126 and 
read/write register 128, the DP 140 picks up the command. The DP 140 commands are always acknowledged 
with a reply from the DP 140 which is returned via the back channel 34 output. 

A forced parity error register is provided in each IPL 36 and each OPL 38. It is used for forcing parity errors 

20 on a forward channel 32 in the OPL 38 or back channel 34 in the IPL 36. The DP 140 may read or write the 
register. If a given forced parity error register is set to 00 when a test command or test reply is received, and 
a circuit exists, then the command or reply is forwarded to the next switch node 16, but otherwise ignored. If 
the register is set to 01 when a test command is received, and a circuit exists, then the test command is for- 
warded to the next switch node 1 6 and the byte which immediately follows has its parity bit inverted before being 

25 forwarded to the next switch node 16 (however, the forwarding switch node 16 does not report an error). If the 
register is set to 01 when a test reply is received and a circuit exists, then the test reply is "backwarded" to the 
previous switch node 16 with its first parity bit inverted (however, the "backward ing" switch node 16 does not 
report an error). In either case, the register is then cleared to zero. If the register is set to 1 0, then the behaviors 
are the same as the 01 case, except that the parity is inverted continuously as long as the register is set to 1 0, 

30 and the register is not automatically cleared to 00. 

6. PACKAGING 

In the preferred embodiment, each network 14 is constructed using up to four different boards, i.e., Type- 
35 A, -B, -C, and -D boards. Type-A and -D boards are used if the network 14 contains between 2 and 64 network 
I/O ports 20; Type-A, -B, and -D boards are used if the network 14 contains between 65 and 512 network I/O 
ports 20; and Type-A, -C, and -D boards are used if the network 14 contains between 513 and 4096 network 
I/O ports 20. 

40 7. TYPE-A BOARD 

Figure 6 describes a Type-A board 170. As described hereinbefore, the network 14 is physically folded 
and the switch nodes 16 are paired so that a "left" switch node 16 in a specific stage and level is physically 
adjacent to a "right" switch node 16 in the same stage and level. Each Type-A board 170 contains one such 

45 stage 0 switch node 1 6 pair and one such stage 1 switch node 16 pair. Consequently, eight property connected 
Type-A boards 170 form a network 14 having 64 network I/O ports 20. 

Up to eight PMs 12 may connect via controllers 18 to optical transceivers 22 on each Type-A board 170. 
The optical transceivers 22 communication, via TAXI transceivers 148 and 150, with the eight input ports of a 
first 8x8 switch node 16 in stage 0. Each of the output ports from the first stage 0 switch node 16 communicates 

50 with the input ports of a first stage 1 switch node 16. Up to eight Type-A boards 170 cross-connect between 
the first stage 0 switch nodes 16 and the first stage 1 switch nodes 16, in a manner described in Figure 3, via 
a backplane (not shown). The first stage 1 switch node 16 connects to TAXI transceivers 148 and 150 which 
either loop back (at the bounce-back point 30) to connect to adjacent TAXI transceivers 148 and 150 in a net- 
work 1 4 with 64 or fewer network I/O ports 20, or connect to a Ty pe-B board 1 72 (discussed below) in a network 

55 14 having between 65 and 512 network I/O ports 20, or connect to a Type-C board 174 (discussed below) in 
a network 14 having between 513 and 4096 network I/O ports 20. The TAXI transceivers 148 and 150 connect 
to the input ports of a second stage 1 switch node 16. The output ports of the second stage 1 switch node 16 
connect to the input ports of a second stage 0 switch node 16. Up to eight Type-A boards 1 70 cross-connect 

8 



BNSDOCID:<EP 0511834A2 I > 



0 




EP0 511 834 A2 

between the second stage 1 switch nodes 16 and the second stage 0 switch nodes 16, in a manner described 
in Figure 3, via the backplane. The output ports of the second stage 0 switch node 16 connect to the optical 
transceivers 22, via TAXI transceivers 148 and 150, and thus to the eight PMs 12. 

Note that when interfacing to a TAXI transceiver 1 48 and 1 50, output port /from the switch node 1 6 handling 

5 left to right paths is paired with input port / from the switch node 16 handling right to left paths, and vice versa. 
(For the sake of brevity and clarity, however, Figure 6 shows only the back channel connections, as dotted lines, 
from the TAXI transmitter 148 at the bottom of Figure 6 to the seventh input port on the #1 switch node 16 and 
from the seventh output port on the #2 switch node 16 to the TAXI receiver 150 on the bottom of Figure 6.) 
Thus, any one of the PMs 12 can connect to another of the PMs 12 by appropriate switching of the stage 0 

10 and stage 1 switch nodes 16. 

8. TYPE-B BOARD 

Figure 7 describes a Type-B board 172. Each Type-B board 172 contains two switch node 16 pairs. The 
15 switch node 16 pairs are in stage 2 of any network 14 with more than 64 network I/O ports 20. These switch 
nodes 16 are on either side of the bounce-back point 30 and thus represent the point at which data "bounces 
back", "turns around", or reverses direction in the folded network 14. In networks 14 supporting between 65 
and 512 network I/O ports 20, the stage 1 switch nodes 16 on the Type-A boards 170 are interconnected with 
the stage 2 switch node 16 on the Type-B boards 172 to effect an expansion of the network 14. Thus, any one 
20 of the PMs 12 can connect to another of the PMs 1 2 by appropriate switching of the stage 0, stage 1 , and stage 
2 switch nodes 16. 

9. TYPE-C BOARD 

25 Figure 8 describes a Type-C board 174. For a system 10 supporting between 513 and 4096 network I/O 

ports 20, an additional stage of switch nodes 16 (stage 3) is required, with the switch nodes 16 in stage 3 com- 
municating with the switch nodes 16 of stage 2. Both stage 2 and stage 3 switch nodes 16 are implemented 
on the Type-C board 174. The switch nodes 16 labeled as #1 - #4 are in stage 2 of the network 14; switch nodes 
16 labeled as #5 - #8 are in stage 3 of the network 14. 

30 The input ports of a first stage 2 switch node 16 connect to Type-D boards 176 via TAXI transceivers 148 

and 150. Each of the output ports from the first stage 2 switch node 16 communicates with the input ports of 
a first stage 3 switch node 16. Up to four Type-C boards 174 cross-connect between the first stage 2 switch 
nodes 1 6 and the first stage 3 switch nodes 16, in a manner described in Figure 3, via a backplane (not shown). 
The first stage 3 switch node 16 loops back (at the bounce-back point 30) to connect to the input ports of a 

35 second stage 3 switch node 16. The output ports of the second stage 3 switch node 16 connect to the input 
ports of a second stage 2 switch node 16. Up to four Type-C boards 174 cross-connect between the second 
stage 3 switch nodes 16 and the second stage 2 switch nodes 16, in a manner described in Figure 3, via the 
backplane. The output ports of the second stage 2 switch node 16 connect to Type-D boards 176 via TAXI 
transceivers 148 and 150. Note that when interfacing to a TAXI transceiver 148 and 150, output port /from the 

40 switch node 16 handling left to right paths is paired with input port /from the switch node 16 handling right to 
left paths, and vice versa. (For the sake of brevity and clarity, however, Figure 8 shows only the back channel 
connections, as dotted lines, from the TAXI transmitter 148 at the bottom of Figure 8 to the seventh input port 
on the #3 switch node 16 and from the seventh output port on the #4 switch node 16 to the TAXI receiver 150 
on the bottom of Figure 8.) 

45 

10. COMMUNICATION MODULE ASSEMBLY 

Each cabinet housing the components of the network 14 contains up to six Communication Module As- 
semblies (CMAs). The packaging of components within the CMAs is intended to minimize configuration errors 

so and simplify manufacturing and field upgrading. There are three types of CMAs, i.e., CMA/A, CMA/B, and 
CMA/C, depending on the size of the network 14: the CMA/A type is used in networks 14 supporting between 
2 and 64 network I/O ports 20; the CMA/A and CMA/B types are used in networks 14 supporting between 65 
and 512 network I/O ports 20; and the CMA/A and CMA/C types are used in networks 14 supporting between 
513 and 4096 network I/O ports 20. 

55 Figure 9 illustrates a network 14 comprising a single CMA/A 1 82, which supports between 2 and 64 network 

I/O ports 20. The CMA/A 182 contains a power board, up to 8 Type-A boards 170, and 2 Type-D boards 176. 
The Type-A and Type-D boards 176 are arranged in two groups of five boards each. In each group, the first 
two slots hold Type-A boards 170, the next slot holds a Type-D board 176, and the remaining two slots hold 
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Type-A boards 170. The UWP between stage 0 and stage 1 switch nodes 16 is embedded in a backplane 1 80. 

The Type-D board 176 in the CMA/A 182 interconnects up to four Type-A boards 170 in a CMA/A 182 to 
up to four Type-B boards 172 in a CMA/B 184. The rationale behind the Type-D board 176 is that there is no 
room for electrical connectors on the front panels of Type-A boards 170 to carry the signals from the Type-A 
boards 170 in the CMA/A 182 to Type-B boards 172 in a CMA/B 184. Therefore, the Type-D board holds four 
connectors on its front and the board is used only as a repeater of high speed TAXI signals. There can be up 
to two Type-D boards in a CMA/A 182 to service eight Type-A boards 170 in the CMA/A 182. 

Figure 10 describes circuit switching within the CMA/A 182 and illustrates the Type-A board 170 connec- 
tions to the backplane 180 and the PMs 12. In the preferred embodiment, all the stage 0 to stage 1 intercon- 
nections are between Type-A boards 170 residing in the same CMA/A 182, so the interconnection pattern, i.e., 
the UWP, between the stages is embedded in a backplane 180. 

Within the Type-A boards 170, the bounce-back point 30 is created by connecting each of the eight TAXI 
transmitters 148 to the corresponding TAXI receivers 150 (see also, Figure 6). Note that for a network 14 of 
this size, as an option, a non-expandable Type-A board 170 could be used with the following modifications to 
the board shown in Figure 6: (1) the output TAXI transceivers 148 and 150 on the right side of Figure 6 would 
be eliminated; and (2) the outputs from the switch node 16 labeled as #3 would be connected directly to the 
inputs to the switch node 16 labeled as #4. Doing this would substantially lower the power consumption (by 
approximately 1/3) and cost of the Type-A board 170. The main drawback is having an additional board type. 
However, this configuration could be expected to meet the needs of many systems. 

Figure 11 illustrates a network 14 having-CMA/As 182 and CMA/Bs 184, which support between 65 and 
512 network I/O ports 20. Each CMA/B 184 houses eleven slots containing a power board, two dummy slots, 
and two groups of four Type-B boards 172. For networks 14 supporting between 65 and 512 network I/O ports 
20, each fully configured CMA/A 1 82 requires connection to one group in a CMA/B 1 84, I.e., every Type-B board 
172 can connect to two Type-A boards 170. For networks 14 supporting 64 or fewer network I/O ports 20, no 
CMA/B 184 is required. In the preferred embodiment, the stage 1 to stage 2 interconnection pattern, i.e., the 
UWP, is embedded in a backplane 1 80 in the CMA/B 1 84. (T wo backplanes 1 80 are shown in Figure 1 1 because 
each group of four Type-B boards uses a different backplane.) 

Figure 1 2 illustrates a network 14 having CMA/As 182 and CMA/Cs 186, which support between 513 and 
4096 network I/O ports 20. Each CMA/C 186 houses a power board, two dummy boards, and up to two groups 
comprised of four Type-C boards 174. For networks 14 supporting between 513 and 4096 network I/O ports 
20, each fully configured CMA/A 182 requires connection to one group in a CMA/C 186, i.e., every Type-C board 
174 can connect to two Type-A boards 170. In the preferred embodiment, all the stage 2 to stage 3 intercon- 
nections are between Type-C boards 174 residing in the same CMA/C 186, so the interconnection pattern, i.e., 
the UWP, between the stages is embedded in a backplane 180. (Two backplanes 180 are shown in Figure 12 
because each group of four Type-C boards uses a different backplane). 

11. SIMPLIFIED CABLING 

In the present invention, simplified cabling is intended to minimize configuration errors and simplify man- 
ufacturing and field upgrading. It is desirable to manufacture cables with a minimum number of different lengths. 
Without this capability, a given cable might not reach a specific connector in the specified CMA, although there 
are some connectors in that CMA it does reach. With this capability, it can be plugged into the connector that 
it does reach. In the field, connectors can be moved as needed for routing convenience. Thus, field engineers 
do not have to deal with as many configuration errors. 

In the present invention, signal wires are grouped into multiconductor cables so that the number of cables 
that have to be handled is minimized. Cables within the network 14 can be plugged into almost any available 
connector in a chassis with minimal constraints. There are only two constraints on how to install cables: (1) 
two ends of the same cable cannot be plugged into the same board type; and (2) each cable end is constrained 
only as to which of several CMA/As 182 or CMA/Bs 184 (which group in the case of a CMA/B 184) it is con- 
nected. The cable may be plugged into any available connector in the correct CMA/A 182 or CMA/B 184, i.e., 
any of the four connectors on either Type-D board 176 in a CMA/A 182 or either connector on any of the four 
Type-B boards 172 in either group of a CMA/B 184. However, a connector on the Type-D board 176 is not con- 
sidered available unless the slot to which it is wired contains a Type-A board 170. Unavailable connectors may 
be capped in manufacturing. 

Figure 13 (a) illustrates a cable harness assembly 178, wherein each cluster of eight cables labeled with 
a letter (A through R) plugs into one bidirectional switch node 16 pair. Connectors A through H connect to switch 
nodes 16 on Type-A boards 170 (through the Type-D board 176) and J through R connect to switch nodes 16 
on Type-B boards 172. Figure 13 (b) provides a simplified representation of the cable harness assembly 178 
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of Figure 13 (a). 

Due to limited space for cable routing within a cabinet and the complexity of the cable harness assembly 
178, it is preferable to avoid manufacturing a cable harness assembly 178 which is physically constructed as 
shown. Hence, the cabling is implemented as follows. 
5 For a network 14 with at least 65 but no more than 512 network I/O ports 20, one type of cable harness 

assembly 178 with variations in length is used. This cable harness assembly 178 is illustrated in Figure 14 and 
is equivalent to the cable harness assembly 178 shown in Figures 1 3 (a) and (b). The cable harness assembly 
178 comprises eight bundles, labeled A-H, wherein each bundle has eight pairs of coaxial cable. The cross 
connections are embedded in the backplane 180 to which the Type-B boards 172 are attached. The two con- 
to nectors attached to the front panel of Type-B boards 172 are wired directly to the backplane 1 80 where they 
are distributed to the appropriate stage 2 switch nodes 16. The net result is as though the cable harness as- 
sembly 178 of Figures 13 (a) and (b) is used and each of its connectors, J through R, are directly connected 
to the TAXI transceivers 148 and 150 of a bidirectional switch node 16 pair on a Type-B board 172 instead of 
being routed through the backplane 180. 
15 As additional network I/O ports 20 are added, only an approximately proportional amount of hardware is 

added, in most cases. Thus, the network 14 may be expanded in small increments while maintaining perfor- 
mance, in contrast to prior art networks 14 which require large increments of hardware to be added to maintain 
bandwidth when certain size boundaries are crossed, e.g., N = />' + 1; wherein N is the number of network I/O 
ports 20, b is the number of switch node 16 I/O ports, and i - 1, 2, etc. 
20 The cabling of networks 14 with more than 64 network I/O ports 20 allows for graceful expansion as the 

number of network I/O ports 20 is increased. The number of additional boards is kept to a minimum. As addi- 
tional network I/O ports 20 are added to a network 14, the need to add Type-A boards 170 is determined by 
such factors as: (1 ) the number of Stage 0 to Stage 1 paths available by virtue of the Type-A boards 1 70 already 
present; (2) the percentage of the maximum possible bandwidth desired; (3) the number of optical transceivers 
25 22 needed to physically connect all PMs 12; and (4) the number of CMAs that must be cross-linked. 

As a network 14 grows from N = 1 to N = 512, either no additional hardware is required when a processor 
is added (the majority of the cases, i.e., 448 out of 512), or there is a linear increase of up to one additional 
resource of each type (57 out of 512 cases), or there is a discontinuity with more than linear growth (7 out of 
512 cases). 

30 The seven discontinuities are shown in Table I. The increment from 64 -> 65 is the worst case percentage- 

wise, because that marks the transition from two stages to three stages. At all remaining discontinuities, the 
percentage increase is never greater than 12.5% (1/8th) beyond linear. There is no compounding effect due 
to the discontinuities in that, once a discontinuity is crossed, as N grows, no additional hardware is added at 
all until the linear growth relationship is restored, i.e., N "catches up" to the number of Type-A boards 170 or 

35 Type-B boards 1 72. This is illustrated in Table I where the ratios of numbers before the discontinuity is always 
perfectly linear, but not after. For example, in the "Type-A" column, X A ->>Y A is the change shown and, corre- 
spondingly, in the "N" column, X N ->Y N . Therefore, X A /X N is always 1/8th, which is perfect because one Type- 
A board 170 can accommodate eight network I/O ports 20. 

The minimum percentage of maximum possible bandwidth in a network 14 may be arbitrarily set to 50%. 

40 In order to maintain this bandwidth, the following formulae are used to calculate the number of CMA/As 182 
(#CMA/A), CMA/Bs 184 (#CMA/B), Type-A boards 170 (#A). Type-B boards 172 (#B), and Type-D boards 176 
<#D): 

^ #CMA/A = TN/64] 

#A = MAX(rN/8l # (8 * |_(N-1) / 64j + 

MAX( f(N MOD 64) / 81, fSQRT( (N MOD 64) /2)~|, 
50 TN / 1281 * (N > 64)))) 

#B = (N > 64) * 4 * fN / 128] 
55 #D = (#CMA/A + MIN(#CMA/A - 1, ( 2 * #B - MIN(4, #A - 8 

* (#CMA/A - 1))) MOD(4 * ( #CMA/A - 1)))) 
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#CKA/B = [#B / 8l 

wherein MAX is a maximum function, MIN is a minimum function, r "I is a ceiling function, L J is a floor function, 
MOD is an integer remainder, SQRT is a square root, and > is a boolean "greater than" function. 
5 To configure a system 10 for N PMs 12 such that 100% of the maximum possible bandwidth is available, 

the following formulae are used to determine the number of CMA/As 182 (#CMA/A), CMA/Bs 184 (#CMA/B), 
Type-A boards 170 (#A), Type-B boards 172 (#B), and Type-D boards (176) (#D) that are required: 

10 #CMA/A = [N I 64] 

#A = MAX(fN / 81, (8 * |_( N " 1 ) I 64 J + 

MAX( fSQRT(N MOD 64)1, |N / 64l * (N > 64)))) 
#B = (N > 64) * HAX ( f#A / 2 \ t 4 * L(N - 1) / 64j + 
fN / 1281) 
2Q #D = T#A / 41 

#CMA/B = f#B / 81 

Table II shows an example of the number of Type-A boards 1 70 needed versus the number of PMs 1 2 for 

25 a network 14 with up to 64 network I/O ports 20 if only 50% of the maximum possible bandwidth is required. 
For up to 32 network I/O ports 20, the number of PMs 1 2 accommodated is determined by counting the number 
of connections between the switch nodes 16 on the number of boards indicated. Beyond 32 network I/O ports 
20, the number of boards required is strictly determined by the number of optical transceivers 22 required to 
accommodate that number of PMs 12. 

30 Table III shows an example of the number of Type-A boards 170 to install in the least populated CMA/A 

182 given the number of PMs 12 to be connected to the depopulated CMA/A 182. This assumes 100% of the 
maximum possible bandwidth is to be provided. In this case, the number of boards required is always limited 
by the number of connections available between Stage 0 and Stage 1 switch nodes 16. In a network 14 with 
more than 64 PMs 12, a Type-B board 172 is provided for every two Type-A boards 170. However, there must 

35 be at least as many Type-B boards 172 as there are CMA/As 182, so extra boards may have to be added. In 
most cases, if any additional hardware is required, the addition of a single PM 12 to the network 14 may require 
the addition of one Type-A board 1 70, and one Type-B board 1 72 per network 1 4. If the current number of PMs 
12 is a multiple of 64, then the addition of a single PM 12 requires two to four additional Type-B boards 172, 
possibly an additional CMA/B 184 chassis, an additional CMA/A 182 chassis, 2 additional Type-D boards 176, 

40 and one additional Type-A board 1 70 for every group of four Type-B boards 1 72 (maximum of eight). On aver- 
age, however, the number of boards and CMAs required is directly proportional to the number of PMs 12. 

In the #A formula above, for 100% bandwidth, as the network 14 grows from 1 to 512 network I/O ports 
20, the term: 

l~N / 81 (A) 

45 makes sure there are enough network I/O ports 20 to plug PMs 12 into. This term handles the case where N 
is 64x. 

The term: 

8 * L(N - 1)/64J (B) 

calculates the number of completely full CMA/As 182, as long as there is at least one more partially populated 
so one. 

In the term: 

TSQRT(N MOD (64)1 (C) 
(N MOD 64) calculates the leftover part for the partially populated CMA/A 1 82 and the SQRT function accounts 
for the cross-connect between stages 0 and 1. If this is larger than the second term (B), then we are assured 
55 of being able to cross-connect all Type-B boards 172. 
The term: 

rN/641 (D) 

makes sure there are enough Type-A boards 170 to cross-connect with Type-B boards 172. This is where the 
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overhead comes from. 
The term: 

(N > 64) (E) 

assures that the (D) term is used only if N > 64. 

To compare the results for the #A formula for both N = 64x and N = 64x + 1 , 1 < x < 8, examine the following 
derivation: 

MAX(f(64x + 1) / 8], (8 * L((64x + 1) - 1) / 64j + ■ 

MAX( rSQRT(64x + 1) mod 64], f(64x + 1) / 64] ) ) ) - 

MAX(T64x / 81/ (8 * l(64x - 1) / 64j + 

MAX(fSQRT( 64x MOD 64)"], f64x / 64] ) ) ) 

MAX((8x + 1), (8x + MAX(1, x + 1))) - 

MAX(8x, (8 * (x - 1) + MAX(0, x) ) ) 

MAX((8x + 1), (8x + x + 1)) - 

MAX(8x, (8x - 8 + x) ) 

(9x + l) - 8x 

x + 1 



This is the number of Type-A boards 170 added in crossing over from N = 64x to N = 64x + 1. Since we 
30 would expect to add 1 due to linear growth, the overhead is x. This percentage of the total is 1 00 * x / 8x = 1/8 
* 100 = 12.5%. The overhead, x, comes from the term: 

l~N / 641 

for N= 64x + 1, which accounts for providing cross connections to the Type-B boards 172. The constant over- 
head ratio is due to the fact that the number of extra boards grows as x, and networks 14 that are multiples of 

35 64 in size, by definition grow as x. The 1/8th value is due to the fact that eight Type-A boards 170 are needed 
for every 64 network I/O ports 20 provided, but only one extra Type-A board 1 70 is needed per 64 network I/O 
ports 20 in the least populated CMA/A 182 to allow it to be connected to the Type-B boards 172. 

If the above derivation was repeated for the remaining formulae, i.e., for the #CMA/A, #CMA/B, #B, and 
#D formulae, as illustrated in Table I, none of the increases would exceed 12.5%. Those skilled in the art will 

40 readily recognize how to derive the other formulae, based on the information given above. 

In changing from one network 14 size to another, it may be necessary and/or desirable to completely dis- 
connect all of the intra-network 14 cables and reconnect them for the new configuration. For small networks 
14 (relative to one with 512 network I/O ports 20), the changes will typically involve moving a small number of 
cables from one board to another as will be illustrated below. 

45 For networks 14 with at least 65 and no more than 512 network I/O ports 20, the eight connectors at one 

end of the cable harness assembly 178 described above are attached to the corresponding eight connectors 
on the four Type-B boards 1 72 in one group of a CM A/B 1 84. The eight connectors at the other end of the cable 
harness assembly 178 are distributed evenly among CMA/As 182 that are fully populated with Type-A boards 
170, and are attached to Type-D boards 176 within the selected CMA/As 182. Connectors that would be allo- 

so cated to a CMA/A 182 that is partially filled with Type-A boards 170 are evenly redistributed to CMA/As 182 
that have all eight Type-A boards 170. 

For networks 14 with at least 65 and no more than 512 network I/O ports 20, to provide at least 50% of the 
maximum possible bandwidth, the number of cable harness assemblies used to interconnect X CMA/As 182 
to [X/4] CMA/Bs 184 is X/2 if X is even and (X+1)/2 if X is odd, wherein [X/4] is a ceiling function providing the 

55 smallest integer not less than X/4. Cable harness assemblies can be added one at a time until there are a total 
of X cable harness assemblies, at which point 100% of the maximum possible bandwidth will be available. 

Figure 1 5 shows a simplified wiring diagram describing how the switch nodes 1 6 are connected in a network 
14 having 128 network I/O ports 20. The CMAs are represented by the solid boxes. The left hand block repre- 
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sents a CMA/A 1 82 with eight Type-A boards 1 70. The right hand block represents a CMA/B 1 84 with two groups 
of four Type-B boards 172 each therein. Two cable harness assemblies are used to link the Type-A boards 
170 in each CMA/A 182 to the Type-B boards 172 in the CMA/B 184. 

Figures 16(a), (b), (c) and (d) provide simplified wiring diagrams describing the expansion from 64 PMs 12 
5 to 65-128 PMs 12. In each case, each PM 12 gets at least 50% of the maximum possible bandwidth. 

In Figure 16 (a), CMA/A #1 need only contain one Type-A board 170 and one Type-D board 176 and only 
one connector from the CMA/A end of the cable harness assembly 178 is connected to the Type-D board 176. 
The other seven connectors are attached to any seven of the eight available Type-D connectors in CMA/A #0. 
Recall that the Type-A boards 170 comprise StagesOand 1 of the network 14, so all PMs 12attached to CMA/A 
10 #0 can establish paths to switch nodes 16 in Stage 1 to which a cable is attached. The switch nodes 16 in Stage 
0 will automatically sense any Stage 1 switch nodes 16 that are unconnected and avoid trying to establish paths 
through them. Note also that there would be up to 64 optical cables attached to the "left" side of each CMA/A 
182 in the Figure 16 (a) for connection to the PMs 12, although they are not explicitly shown. 

Figure 16 (b) shows the cabling for the situation in which there are three to eight additional PMs 12 beyond 
15 64. Two Type-A boards 170 are required in CMA/A #1 and each associated connector on the Type-D board 
176 must have a cable harness assembly 178 attached to maintain a balanced bandwidth between CMA/A #0 
and CM A- A #1. A connection is moved from CMA/A #0 to CMA/A #1 for each Type-A board 170 added until 
there are at least four. At that point, the bandwidth is as evenly split as possible using one cable harness as- 
sembly 1 78. Again, within each CMA/A 1 82, it does not matter to which of the eight possible connection points 
20 four of the cable connectors are attached. It also does not matter which four of the cables in the cable harness 
assembly 178 go to which CMA/A 182, they just have to be evenly divided to maintain uniform bandwidth; in 
any event, the network 14 would still function correctly. 

Figure 16 (c) shows the cabling for the situation in which there are 9-18 additional network I/O ports 20 
beyond 64. 

25 Figure 16 (d) shows the cabling for the situation in which there are 19 - 78 additional network I/O ports 20 

beyond 64. 

Figure 17 shows the cabling for the situation in which there are 512 network I/O ports 20 in the network 
14. Twelve CMAs are present comprising eight CMA/As 1 82 that are fully populated with eight Type-A boards 
170 (and two Type-D boards 176), and four CMA/Bs 184 with each group populated with four Type-B boards 

30 172. All of the CMAs are housed in two docked cabinets (not shown). Eight cable harness assemblies are used 
to connect the CMA/As 182 to the CMA/Bs 184. The bandwidth of this network 14 can be reduced in increments 
of 1/8th by depopulating Type-B boards 172 from any CMA/B 184, four at a time. For each set of four Type-B 
boards 172, i.e., one group, removed from a CMA/B 184, the corresponding cable harness assembly 178 is 
also eliminated. The main reason to depopulate would be to lower the cost of the network 14 without losing 

35 functionality. 

Figure 18 shows the cabling for the situation in which there are more than 512 network I/O ports 20 in the 
network 14. To configure a network 14 with more than 512 PMs 12 requires the use of a Type-C board 174 in 
place of the Type-B board 172 and a change in the way the cabling is implemented. Twelve CMAs are present 
comprising eight CMA/As 182 that are fully populated with eight Type-A boards 170 (and two Type-D boards 

40 176), and four CMA/Cs 186 with two groups that are populated with four Type-C boards 1 74. These CMAs are 
housed in two docked cabinets (not shown). Functionally, it is necessary to use the cable harness assembly 
178 of Figure 14 with the Type-C boards 1 74. A total of eight such cable harness assemblies are required to 
connect the CMA/As 182 with the CMA/Cs 186 in Figure 17. For each set of four Type-C boards 174, i.e., one 
group, removed from a CMA/C 186, the corresponding cable harness assembly 178 is also eliminated. The 

45 main reason to depopulate would be to lower the cost of the network 14. Depopulating also reduces cabling. 
The Universal Wiring Pattern is embodied by the cable harness assembly 178. To cross-connect the 
docked cabinets each cable harness assembly 178 is cut in the middle and attached to connectors 18. This 
allows the cabinets to be connected via cable bundles 190 that contain parallel wires. The constraints on the 
way in which the cable bundles 190 are connected between cabinets are similar to the intra-cabinet cabling 

so discussed earlier. The two rules are: (1) two ends of the same cable bundle 190 shall not be plugged into the 
same connector types; and (2) the cable bundles 190 shall be uniformly distributed among all docked cabinets. 
As a result, there is tremendous flexibility in the configurations and in the connections of the network 14. 

Figure 19 shows the cabling for the situation in which there are 1024 network I/O ports 20 in the network 
14. Each pair of docked cabinets 188 contains twelve CMAs. Eight CMA/As 182 are fully populated with eight 

55 Type-A boards 170 (and two Type-D boards 176) each, and four CMA/Cs 186 with two groups are populated 
with four Type-C boards 174. In this case, to balance the bandwidth, four cable bundles 190 each connect the 
cabinets 188 to themselves and another eight cables cross-connect into each other. The configuration shown 
is cabled for 1 00% of the maximum possible bandwidth . At the 50% level, the cable bundles 1 90 shown in dash- 
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ed lines would be removed as well as all Type-C boards 174 in the lower docked cabinet 188 pair labeled as 
#1. 

Figure 20 shows the largest possible configuration of 4096 network I/O ports 20 using eight pairs of docked 
cabinets 188 to house the network 14. A total of 64 cable bundles 190 are needed in this case. The bandwidth 
5 can be lowered by removing sets of Type-C boards 1 74, one docked cabinet 1 88 pair at a time. For each docked 
cabinet 188 pair, eight cable bundles 190 are removed. 

Notice that the lines representing the cable bundles 190 in Figure 20 form the Universal Wiring Pattern 
(UWP). This is because there are 64 copies of the UWP used to connect stage 1 switch nodes 16 to stage 2 
switch nodes 16, and the wires that form each cable bundle 190 have been chosen to be from the same location 
10 in each of the 64 copies, i.e., it is as though the 64 UWPs were all stacked on top of each other. 

Any configuration other than those illustrated can be readily constructed by following the minimal construc- 
tion rules outlined above. It is understood that the manufacturing, field service, and marketing organizations 
may wish to impose additional rules for the sake of simplicity and/or minimizing the number of different config- 
urations. Of note, however, is the ability to configure any network 14 size using the smallest possible amount 
15 of hardware that gets the job done. In particular, an entry level network 1 4 can be offered with two depopulated 
CMA/As 182, which keeps the cost as low as possible. 

12. SWITCH NODE ADDRESSING 

20 Referring again to Figure 4, each 8x8 switch node 16 has a 12 bit chip address register 121 that is used 

for specifying the switch node 16 location in the network 14. This location, called the chip's address, is defined 
as: 

C = {Cn C 10 C 9 C 8 C 7 C 6 C 5 C 4 C 3 C2 C<| Cq} 2 

The bit positions are defined in Table IV. At startup, the chip address register 121 is loaded from the DP 140. 

25 The Right/Left bit, c 11( distinguishes between switch nodes 16 that route traffic to the right from the PM 12 

to the bounce-back point 30 in the folded network 14, versus switch nodes 16 that route traffic to the left from 
the bounce- back point 30 in the folded network 14 to the PM 12. Bit c u is set to 0 2 for those switch nodes 16 
with right arrows, #1 and #3, on Type-A boards 170 and Type-B boards 172 as shown in Figure 6 and Figure 
7. Bitcn is set to 1 2 for those switch nodes 16 with left arrows, #2 and #4, on Type-A boards 170 and Type-B 

30 boards 1 72 as shown in Figure 6 and Figure 7. 

The Stage number, c 10 C9, is 00 2 for those switch nodes 16 on Type-A boards 170 that connect to controllers 
1 8. They are under the "Stage 0" label in Figure 6. Bits c 10 c 9 are 01 2 for those switch nodes 16 on Type-A boards 
170 under the "Stage 1" label in Figure 6. On the Type-B board 172 shown in Figure 7, all four of the switch 
nodes 16 have their c 10 C9 bits set to 10 2 . 

35 Bits c 8 ...Co determine the switch node 16 Level number in the network 14. This number, appended at the 

least significant end with a three bit switch node 16 port number, p 2 PiPo» defines the Level of the network I/O 
port 20 in the network 14, i. e., c 8 ... c 0 p 2 PiP 0 . 

Bits c^Cq are derived for every switch node 16 on a Type-A board 170 from its slot location in the CMA/A 
182. The locations are encoded in four dedicated pins per slot from the backplane 180. The encoding begins 

40 with 0000 2 in the leftmost board slot (the power board) and ends with 1010 2 in the right most board slot. The 
DP 140 translates these physical numbers into the logical three bit number, c^Cq, needed. After translation, 
the left most Type-A board 1 70 slot is assig ned 000 2 . Each subsequent Type-A board 1 70 is assigned a number 
which increases by 1 (skipping over Type-D slots) up to the right most Type-A board 170, which is 1 1 1 2 . 
Bits c 3 c 2 c 1 are derived for every switch node 16 on a Type-B board 172 from its slot location in the CMA/B 

45 1 84. The locations are encoded with four dedicated pins per slot from the backplane 1 80. The encoding begins 
with 0000 2 in the left most board slot (the power board) and ends with 101 0 2 in the right most board slot The 
DP 140 translates these physical numbers into the logical three bit number, 030^, needed. After translation, 
the left most Type-B board 172 is assigned 000 2 . Each subsequent Type-B board 172 is assigned a number 
which increases by 1 up to the right most Type-B board 1 72, which is 1 1 1 2 . 

50 Bit Co is 0 for the upper two switch nodes 16 on a Type-B board 1 72 and 1 for the lower two switch nodes 

16. 

For a CMA/A 182, bits c 5 c 4 ,C3 are derived from the CMA's location in the cabinet 188. For a CMA/B 184 
or CMA/C 1 86, bits c 5 c 4 are derived from the CM A's location in the cabinet 1 88. They are the same for all switch 
nodes 16 on every board in the same CMA. The DP 140 derives these bits as described in the dynamic con- 
55 figuration procedure, described hereinafter, and stores them into each switch node 16 to which it is connected. 

Bits CaC/Ce are derived from the most significant three bits of the four least significant bits of the cabinet 
188 number. One docked cabinet 188 pair has an even cabinet 188 number and the other in the pair has the 
next larger number. The cabinet 188 number is determined during the dynamic configuration procedure by the 
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DP 140 in the power subsystem, i.e., the gateway DP (not shown). This number is distributed to all DPs 140 
in the cabinet 188 by the LAN 144 interconnecting the DPs 140. Each DP 140 stores the number into each 
switch node 16 on a board to which it is connected. For networks 14 with no more than 512 network I/O ports 
20, in the case of a local area network 14 failure, these bits are set to 0. For networks 14 with no more than 
5 512 network I/O ports 20, these bits are the same in every switch node 16 in the network 14. For networks 14 
with more than 512 network I/O ports 20, the cabinets 188 containing one network 14 are numbered sequen- 
tially, starting with an even number. 

13. AUTOMATIC PROCESSOR ID ASSIGNMENT 

10 

Automatic processor identification assignment consists of the ability to plug a PM 12 into any available net- 
work I/O port 20 and have it receive a unique port identifier from the network 1 4. Thus, each PM 1 2 in the network 
14 can determine its address in either network 14 by simply asking the network 14. This means that it does 
not matter where any given PM 12 is plugged into the network 14. This greatly simplifies network 14 installation. 

15 The PM's address within a given network 14 is determined by the Level number of the network I/O port 20 
to which it is connected in that network 14. The PM 12 determines its address in each network 14 by transmitting 
a Send-Port-Addr command to the network 14. The switch node 16 that receives this command supplies the 
network I/O port 20 address via the Escape reply with a Port-Add r-ls- Key and the address itself. 

Bits c 8 ...Co determine the Level number of the switch node 16 in the network 14. This number, appended 

20 at the low order end with a three bit switch node 16 port number, P2P1P0, defines the Level of the network I/O 
port 20: c 8 ...c 0 p 2 p 1 Po. This is the address that is supplied to a PM 12 when it asks the network 14 where it is 
attached. 

14. DYNAMIC CONFIGURATION 

25 

Figure 21 is a flow chart describing the steps required for configuring the network 14. Since cables con- 
necting the boards in the network 14 can be configured in relatively arbitrary ways, the network 14 automatically 
determines how it is cabled and uses that configuration to establish the path between PMs 12. A protocol be- 
tween switch nodes 16 permits one switch node 16 to ask another switch node 16 at the other end of a back 
30 channel 34 to transmit its chip address back via the forward channel 32. These chip addresses are used to 
build the tag mapping tables 108, which ensure that routing tags can be correctly interpreted to establish com- 
munication paths between PMs 12. 

After a PM 12 is powered up, it performs a self test procedure to test the links of the network 14. It then 
transmits a Send-Port-Addr command to the network 14 and waits for an Escape reply with a Port-Addr-ls key 
35 on the back channel 34 which contains the 12-bit address for the PM 12 on the network 14. 

If the state of the network 14 is "configuring", the PM 12 volunteers to perform the configuration task. A 
local DP 140, i.e., a DP 140 on the Type-A board 170 connected to the PM 12, signals whether the PM 12 has 
been accepted or rejected as the Master PM 12 (only one PM 12 per network 14 may be designated as a Master 
PM 12). if it is rejected, the PM 12 disconnects from the DP 140 and waits to be notified that the configuration 
40 is complete. If it is accepted, the Master PM 12 configures the network 14. 

The configuration steps determine the topology of the network 14 and account for any switch nodes 16 or 
links that fail a self-test. The Master PM 12 constructs the tag mapping tables 108 that account for the topology. 
The network 14 is available for use once these tables 108 are reloaded in the switch nodes 16. 

At startup, each DP 140 fills in the chip address register 121 of each switch node 16 on its board, i.e., bits 
45 Cn-Co. All switch nodes 1 6, except switch nodes 1 6 in the "right" stage 0 connected to the controllers 1 8, activate 
their forward channel 32 carriers after the DP 140 has enabled all output ports of the switch node 1 6 by setting 
enable bits in each output status register 92. The DP 140 also enables the input ports of the switch node 16 
by setting enable bits in each input status register 66. 

Each input port of a switch node 16 is instructed by the DP 140 to test its back channel 34 by transmitting 
50 an Escape Reply with a Send-Chip-Addr key. Each output port that receives the Send-Chip-Addr key on its 
back channel 34 reads its chip address register 121 and sends the Chip-Addr-ls command out the forward chan- 
nel 32. Receipt of the Chip-Addr-ls command by each input port on every switch node 16 that requested the 
chip address constitutes a test of all forward and back channel links. 

When the Chip-Addr-ls command is received by an input port of a switch node 16, the DP 140 stores the 
55 address in RAM 142. The DP 140 builds a table with eight entries per switch node 16 that identifies where each 
input port is connected. The DP 140 reads the input status register 66 of each input port on each switch node 
16 and constructs an eight bit input enable vector for each switch node 16 that indicates which ports are re- 
ceiving a carrier. The DP 140 reads the output status register 92 of each output port on each switch node 16 
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and constructs an eight bit output enable vector for each switch node 16 that indicates which ports are receiving 
a carrier. Collectively, this information, and the type and location of faults detected by DPs 140, represents the 
raw topology of the network 14. The raw topology information is redundant by virtue of the fact that the network 
14 is symmetric and folded. 

5 The Master PM 12 gets the raw topology information from the DPs 140 via the LAN 144 interconnecting 

the network DPs 140 and local DP 140 of the master PM 12. A local DP 140 is that DP 140 on a Type-A board 
170 which is connected to a stage 0 switch node 16 that is directly connected to the controller 18 of a PM 12. 
The Master PM 1 2 sends the network 14 a DP Connect command and the local DP 140 returns the raw topology 
information associated with its local switch nodes 16 to the Master PM 12. The local DP 140 then requests that 

10 every other DP 140 in the network 14 transmit its raw topology information, so it can be passed to the Master 
PM 12. 

Once the Master PM 12 has received all the raw topology information, it calculates the tag mapping tables 
108, multicast port select vectors, and input and output enable vectors for each switch node 16 in the network 
14. The calculation includes a consistency check on the data and a validation check to make sure no cabling 

15 rules have been violated. The information for the tag mapping tables 108 for each of the switch nodes 16 is 
derived from the chip addresses, either of the switch node 16 in the next stage connected directly thereto, or 
of the switch node 1 6 in the following stage. The tag ma pping table 1 08 needs only [ log 2 b] bits per entry rather 
than [ log 2 N] bits, e.g., 3 bits versus 12 bits. 

If any faults are reported, the calculations simulate the removal of the faulty component by deleting the 

20 appropriate entries in the raw topology information. For example, if a switch node 16 has failed, up to 16 links 
may be deleted. The output enable vectors are set to disable output ports where the links have been removed 
so that the load balancing logic will not select those ports. The tag mapping tables 108 also must not point to 
a disabled output port or an error will be reported if a routing tag references the output port. Input ports are 
disabled so that no spurious errors will be reported, i.e., the output ports they are connected to are disabled 

25 and/or faulty, so they are either sending nothing, which is an error, or garbage, which has already been diag- 
nosed. 

If the failure is in a non-local switch node 16 or link (one not directly connected to a controller 18), the re- 
dundant nature of the network 14 guarantees that the tag mapping tables 108, multicast port select vectors, 
and input and output enable vectors can be computed with no loss of functionality, although there is a slight 
30 decrease in the bandwidth of the network 1 4. It may not be possible to preserve functionality if there are multiple 
failures, depending upon the specific combination of failures. 

If there is one or more failures of local switch nodes 16 or links, the network 14 can be configured to be 
functional for point-to-point communications only if the controllers 18 connected to the faulty components are 
disabled. The network 14 cannot be used for broadcast or multicast. The other network 14 is used for that pur- 
35 pose. 

When the calculation of the tag mapping tables 108, multicast port select vectors, and input and output 
enable vectors is complete, the Master PM 12 re-establishes connection with its local DP 140 and transfers 
the tag mapping tables 108, multicast port select vectors, and input and output enable vectors in packages 
grouped by switch node 1 6. As the local DP 1 40 receives each switch node 1 6 package, it transmits the package 
40 to the appropriate DP 140. The process continues until all DPs 140 have received the packages for every switch 
node 1 6. 

When each DP 140 receives the package, it selects the correct switch node 16 and writes eight tag mapping 
tables 108 into the output port select 58 in each IPL 36 of the switch node 16. The DP 140 then enables and 
disables the eight input ports of the switch node 16 according to the selected eight bit input enable vector, one 
45 bit per input status register 108; the DP 140 also enables and disables the eight output ports of the switch node 
1 6 according to the selected eight bit output enable vector, one bit per output status register 1 08. The multicast 
port select register 130 of the switch node 16 is also loaded with the correct multicast port select vector. Upon 
completion of this task for each switch node 16, the DP 140 signals the local DP 140 with an acknowledgement. 

When the local DP 140 determines that all switch nodes 16 have been configured, it signals the Master 
so PM 12 that the configuration is complete. The Master PM 12 then signals the local DP 140 to change the state 
of the network 14 from "configuring" to "ready." The local DP 140 broadcasts the state change to all other DPs 
140 via the LAN 144 connecting the DPs 140. The network 14 is then ready for use. 

Any PMs 12 that query the local DP 140 for the current state of the network 14 will find out that it is ready 
for use. At this point, all active PMs 12 execute a distributed algorithm to build their processor routing tables. 
55 These routing tables comprise the list of active PMs 12 and their addresses in the network 14. 

A PM 12 that has just initialized and determines that either or both networks 14 are in the ready state, no- 
tifies the other PMs 12 of its presence in the network 14. The PM 12 multicasts its network I/O port 20 address 
on each network 14 to all other PMs 12. By merging replies using an addition mode, the PM 12 knows how 
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many PMs 12 have received the multicast. Each receiving PM 12 adds the network I/O port 20 address to its 
table of PM 12 locations on the indicated network 14. 

A flag is set to note if either network 14 is to be used for point-to-point traffic only. In such a case, some 
PMs 12 are not included on the listfor that network 14, but are on the list of the network 14 capableof performing 
5 multicasts. 

Each PM 12 transmits a point-to-point message to the sending PM 12 of the multicast indicating its I/O 
port address on each network 14. The sending PM 12 can then build its PM 12 routing tables from the point- 
to-point addresses received. Thus, an existing network 14 can be expanded online. 

Figure 22 is a flow chart describing the steps required for reconfiguring the network 14 when a fault occurs 

10 therein. If a fault is detected, the DP 140 can request that the network 14 be reconfigured so that the fault can 
be isolated. Communications in the faulty network 14 are interrupted during reconfiguration. However, com- 
munications within the system 10 are not interrupted because there are two networks 14. The controllers 18 
in each PM 12 automatically switch over to the operational network 14 until the reconfiguration is complete, 
and then return to load balancing traffic between the two networks 14. 

15 For the most part, the reconfiguration steps are similar to the steps performed at network 14 startup. What 
is different is that the configuring Master PM 12 identifies the fault location, via information received from the 
switch nodes 16 and DPs 140. 

In Figure 22, a continuous loop executes so long as there are any unprocessed faulty links or nodes. Within 
the loop, faulty links and switch nodes 1 6 are processed according to their location on either side of the bounce- 

20 back point 30. 

For a faulty "left" link, i.e., a fault on a link between switch nodes 16 in the left half of an unfolded network 
14, including links connected to the output of the last stage, then the Master PM 12 traces back on the link and 
disables the output port of the connected switch node 16. If this results in all the output ports on the connected 
switch node 16 being disabled, then the connected switch node 16 is marked as being "faulty." The Master PM 
25 12 also traces forward on the link and disables the input port of the connected switch node 16. If this results 
in all the input ports on the connected switch node 16 being disabled, then the connected switch node 16 is 
marked as being "faulty." 

For a faulty "left" switch node 16, i.e., a fault in a switch node 16 in the left half of an unfolded network 14, 
then the Master PM 12 traces back on all input ports to the connected switch nodes 16 and disables the output 

30 port of each connected switch node 16. If this results in all the output ports on a connected switch node 16 
being disabled, then the connected switch node 16 is marked as being "faulty." The Master PM 12 also traces 
forward on all the output ports to the connected switch nodes 1 6 and disables the input port of each connected 
switch node 16. If this results in all the input ports on a connected switch node 16 being disabled, then the con- 
nected switch node 16 is marked as being "faulty." 

35 For a faulty "right" link, i.e., a fault on a link between switch nodes 16 in the right half of an unfolded network 

14, then the Master PM 12 traces back on the link and marks the connected switch node 16 as being faulty. 

For a faulty "right" switch node 16, i.e., a fault in a switch node 16 in the right half of an unfolded network 
14, then the Master PM 12 traces back on all input ports to the connected switch nodes 16, marks the connected 
switch nodes 16 and links as being "faulty," and disables the output port of each connected switch node 16. 

40 The Master PM 12 also traces forward on all output ports to the connected switch nodes 16, marks the con- 
nected switch nodes 16 as being "faulty," and disables the-input port of each connected switch node 16. 

In addition to the above steps, attached controllers 18 are disabled if the faulty link or switch node 16 is in 
stage 0. The PMs 12 associated with the disabled controllers 18 use the redundant network 14. All other PMs 
12 are notified that the affected PMs 12 can be reached via the redundant network 14 only. In all cases, the 

45 tag mapping tables 108 are updated so they do not point to any output port that has been disabled. 

In the general case, where a MIN is constructed from bxb switch nodes 16 with n stages and 6° = N I/O 
ports 20, the loss in bandwidth is a function of which stage, /, the error occurred. The following equations com- 
pute this: 

50 



55 



18 



BNSDOCID:<EP 0511B34A2 I > 



10 



EP0 511 834 A2 

Left Half - Link loss = l / Jb n = b 

- Switch node loss = Jb / b n = b 
Right Half - Link at the switch node 

output in stage i = b n_1 / b n = b" 1 

- Switch node loss in 

in stage i = b a-1 / b a = b' 1 



n+i 



For example, in a network 1 4 of 51 2 I/O ports 20, there are three stages (n=3) constructed from 8x8 switch 
15 nodes 16 (6=8). If a switch node 16 in the right half of the unfolded network 14 in stage 1 (the second stage) 
fails, then 1/8th or 12.5% of the total bandwidth will be lost. If there were a loss in stage 0, then 100% of the 
bandwidth is considered lost because eight ports have become inaccessible and the algorithm would not be 
able to guarantee it can route all messages correctly. Hence, those ports and the attached controllers 18 are 
disabled. Doing so restores 100% bandwidth to the remaining healthy, but smaller, network 14. 
20 If any part of a link between a PM 12 and the network 14 fails, the PM 12 uses only the active network 14. 

Each PM 12 in the network 14 is notified whenever a given PM 12 is unavailable on a given network 14. In the 
event that both links connected to a PM 12 fail, the PM 12 is dropped from the network 14. When a switch node 
16 detects a carrier loss from a controller 1 8, it interrupts its DP 140 which determines the network I/O port 20 
involved, and broadcasts the network I/O port 20 to the other DPs 140 via the LAN 144. The other DPs 140 
25 notify their PMs 12 that the PM 12 at the network I/O port 20 given is no longer part of the network 14. The 
receiving PMs 12 delete the entry in their processor routing tables corresponding to the network I/O port 20. 
All PMs 12 switch to using the other network 14 for multicasts. 



15. SIMPLIFIED BACKPLANE ROUTING 

30 

Ordinarily, the layout of the network 14 backplane 1 80 would be very complex due to cross-connecting 128 
twelve bit channels, i.e., two universal wiring patterns (UWP) between stages of switch nodes 16. The cross- 
connect wiring needed to couple one stage of the network 14 to the next consists of 64 paths of 24 traces each 
for a total of 1 ,536 traces. Each is a point-to-point connection, not a bus. These traces are located in a backplane 
35 180 which creates a substantial layout problem. 

However, due to the existence of the tag mapping tables 108 and the method of dynamic reconfiguration, 
it is possible to remap the traces, so that the connections are horizontal only. The tag mapping tables 108 can 
be used to compensate for traces that need to be moved to reduce congestion on the backplane 1 80. 



40 16. CONNECTION PROTOCOL 

The connection protocol between PMs 12 uses a handshaking scheme that sends commands and data 
on the parallel 9-bit forward channel 32, and communicates replies on the serial 1-bit back channel 34. A con- 
nect command, which contains a 16-bit routing tag specifying the destination, is transmitted from a sending 
45 controller 18 to the network 14 on the forward channel 32. The connect command works its way from switch 
node 16 to switch node 16 until it reaches a receiving controller 18. 

Within each switch node 16, the IPL 36 uses a DSTRBI signal 40 generated by the TAXI transceiver 26 to 
strobe a data item received on the forward channel 32 input into an input FIFO 46. A data item may be clocked 
into the input FIFO 46 at every clock cycle. The input FIFO 46 synchronizes data items with an internal clock 
so of the switch node. The output of the input FIFO 46 is examined by a command decode 52, which in conjunction 
with an input state control 56, determines where the data is to be latched, if anywhere. 

When a connect command is received, the switch node 16 stores the connect command in a command/data 
latch 48 and the routing tag in a tag latch 50. The four most significant bits of the routing tag specifies one of 
six methods by which the connect command and thus the path is routed: monocast (i.e., point-to-point) non- 
55 blocking, monocast blocking, monocast blocking without load balancing, multicast, and DP connect The other 
bits of the routing tag are used either to steer the message to the proper PM 12 for monocast connect com- 
mands, or to select a "bounce-back point" for multicast connect commands. 

The routing tag is decoded by the output port select 58 to produce an eight bit vector which identifies the 
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desired output port in the case of a point-to-point connect command, or multiple ports in the case of a multicast 
connect command. In the case of a multicast connect command, a multicast port select reg ister 1 30 is examined 
by the output port select 58 to generate directly the output port selection vector. 

For monocast connect commands, the output port select 58 contains a tag mapping table 1 08 for mapping 

5 the routing tag to the correct output port based on the way the network 14 boards are cabled. The tag mapping 
table 108 is an 8 x 3-bit memory array used for translating the routing tag to an output port selection. The tag 
mapping table 108 provides a logical to physical translation wherein three bits from the routing tag are fed to 
the tag mapping table 1 08 and three bits are output therefrom identifying the particular output port of the switch 
node 16 that should be used. The bits from the routing tag address an entry in the tag mapping table 108, the 

10 contents of which identify the correct output port to receive the command. In the preferred embodiment, the 
bits from the routing tag comprise c 0 -C2 for stage 0, c 3 -c 5 for stage 1, Ce-Ce for stage 2, and Cg-c^ for stage 3. 
(The stage is identified via a chip address register 121.) 

Each bit of the vector generated by the output port select 58, along with control signals, is connected to 
an 8-input arbiter 98 in the OPL 38. The 8-input arbiter 98 receives signals from each IPL 36, selects an active 

15 IPL 36, and outputs the input port number selected. The input port number is used by a path select 1 00 to con- 
nect the IPL 36 to an output latch 104 in the OPL 38 via the output mux 102. 

Within the arbiter 98, multicast connect commands have priority over point-to-point connect commands. 
Note, however, that no circuit either pending or currently established is ever preempted by a multicast. In the 
case of point-to-point connect commands, the arbiter 98 selects an active IPL 36 in a round-robin fashion. The 

20 8-input arbiter 98 only changes the input port selected after a disconnect command has been received. 

The output state control 96 decodes the input port number output from the arbiter 98 and sends an enable 
signal to a feedback select 62 in the associated IPL 36. The feedback select 62 compares all incoming enables 
(one from each OPL 38) to the original bit vector from the output port select 58 for an exact match. When an 
exact match is detected, the input state control 56 is notified that the connect command currently stored in the 

25 command/data and tag latches 48 and 50 has been selected for connection. 

The input state control 56 activates an acknowledgment signal that is distributed by the output port select 
58 to each output port selected. The acknowledgement signal confirms that all output ports needed are available 
and therefore each port can proceed to transmit the stored connect command to the next stage switch node 
16. Transmission is enabled by setting the output mux 102 to select the output of the path select 100 and ac- 

30 tivating the clock to the output latch 1 04. 

Once the circuit has been established through all switch nodes 16, the receiving controller 1 8 acknowledges 
receipt of a connect command by sending "idle replies" on the back channel 34 to the sending controller 18. 
When the sending controller 18 receives the first idle reply, the circuit is complete. 

However, before the back channel 34 can transmit any replies, the IPL 36 in a switch node 16 connected 

35 to the receiving controller 1 8 changes the status of the back channel 34 from "closed" to "open." A closed back 
channel 34 continuously transmits null bits; an open back channel 3 4 continuously transmits idle replies (when 
there is no other reply to transmit). The IPL 36 opens the back channel 34 with a start pattern comprising the 
bits 1 1 0 2 , which allows single bit error detection. Immediately following the start pattern, the receiving controller 
18 begins transmitting idle replies. 

40 Within each switch node 16, the OPL 38 uses a BSTRBI signal 82 generated by the TAXI transceiver 26 
to strobe a data item received on the back channel 34 input into a receive FIF0 1 1 2. A data item may be clocked 
into the receive FIFO 1 12 at every clock cycle. The receive FIFO 1 12 is five bits deep and synchronizes data 
items with the internal clock of the switch node 16. The OPL 38 removes a data item from the receive FIFO 
1 12 at every clock cycle as well. A reply decode 1 10 decodes the bits from the receive FIF0 112. Nulls or start 

45 patterns are discarded; replies are passed intact to a back channel FIFO 114. 

The back channel FIFO 1 14 is capable of storing the longest single reply. A long reply begins with a four 
bit reply code, a first parity bit covering the reply code, an eight bit key field, a second parity bit covering all 
previous bits, the first 8 bits of a 16-bit data field, a third parity bit covering ail previous bits, the second 8 bits 
of the 16-bit data field, and a fourth parity bit the covers the entire reply. If the back channel FIFO 114 is full, 

so the reply decode 110 continues to decode incoming replies, but discards short replies, e.g., the idle reply. 

The input state control 56 sets the back channel output mux 74 to select the output of the back channel 
FIFO 1 14 for connection to the back channel 34 output The IPL 36 shifts out a new back channel 34 bit when- 
ever a BSHFTI signal 76 is strobed. The BSHFTI signal 76 is connected to the BSTRBO signal 78 from the 
switch node 16 connected to the input port. 

55 After receiving the initial idle replies in the back channel 34, the sending controller 18 may begin transmitting 
other commands and messages to the receiving controller 1 8. When transmitting messages to a receiving con- 
troller 1 8, the sending controller 18 transmits a message headerand a checksum through the circuit. The header 
command contains information that allows the receiving controller 18 to determine what the sending controller 

20 



BNSDOCID:<EP 0511834A2 I > 




EP 0 511 834 A2 

18 expects from it, e.g., the allocation of space to receive a data segment, the reporting of a status condition, 
etc. The checksum command is used by the controllers 18 for error detection and correction. If a circuit is cur- 
rently active, the switch nodes 16 pass it on as though it were data. If a circuit is not currently active, the com- 
mand decode 52 sets a command error flag in the input status register 66. The receiving controller 1 8 acknowl- 
5 edges receipt of the header and its checksum by responding with a reply on the back channel 34 consistent 
with the instructions in the header. 

A header with data transmission is similar to the header-only transmission with the exception that an extra 
acknowledgement is included. An ACK (acknowledge) reply is transmitted by the receiving controller 18 after 
the receipt of the last header data item or header checksum. Upon receiving the ACK reply from the receiving 
10 controller 18, the sending controller 18 proceeds to send the data. A disconnect success command is not sent 
by the sending controller 18 until the ACK reply is received from the receiving controller 18. 

Whenever there are no other commands or data to be sent, the sending controller 18 transmits idle com- 
mands on the forward channel 32 to the receiving controller 18, and the receiving controller 18 returns idle re- 
plies on the back channel 34 to the sending controller 18. 

15 

17. DISCONNECTION PROTOCOL 

When positive replies are received and the sending controller 18 has nothing further to be done, it transmits 
a success command that indicates the circuit is to be disconnected. For example, after a data block has been 
20 sent, the sending controller 18 is required to wait until it has received idle replies from the receiving controller 
1 8 before sending a disconnect (success) command (if the idle replies were already received while the sending 
controller 18 was in the process of transmitting the data block, then it does not have to wait). 

If any replies on the back channel 34 are negative, a failure command is transmitted by the sending con- 
troller 18, which causes the circuit to be disconnected. The failure command also causes the receiving controller 
25 18 to abort the work associated with the communication session. 

Upon receipt of the success or failure command, the input state control 56 of each switch node 1 6 goes 
into a "disconnect pending" state and passes the command to the next stage switch node 16. Upon receiving 
a success or failure command, the receiving controller 18 sends a success or failure reply, and stops transmit- 
ting replies on the back channel 34. The switch nodes 16 interpret a success reply as a disconnect reply. How- 
30 ever, if an error occurs, e.g., the receiving controller 1 8 could not validate a previous data block, it may respond 
with a failure reply, which is interpreted as a disconnect (failure) reply and aborts the disconnect process. 

Unlike the connect command which created the circuit, the disconnect (success) reply tears down the circuit 
from the receiving controller 18 back to the sending controller 18. However, the switch nodes 16 enforce a pro- 
tocol requiring the disconnect (success) command to propagate forward before the disconnect (success) reply 
35 is recognized. 

When a disconnect (success) reply arrives on the back channel 34 in a switch node 16, the input state con- 
trol 56 disables the output port select 58, thus dropping the request to the 8-input arbiter 98 and the connection 
between IPL 36 and OPL 38. This causes each selected OPL 38 to transmit soft carriers on the forward channel 
32 output and frees the 8-input arbiter 98 to select another IPL 36 for service. Simultaneously, the IPL 36 trans- 
40 mits the disconnect (success) reply, followed by nulls, on the back channel 34 output. The disconnect (success) 
reply is thus propagated out to the next stage switch node 16, where the disconnect reply causes that switch 
node 1 6 to do the same thing, and so on until the disconnect (success) reply is propagated back to the sending 
controller 18. 

When the IPL 36 detects soft carriers on the forward channel 32 input, the input state control 56 transitions 
45 to a "no circuit" state and becomes ready to participate in another circuit. When the sending controller 18 re- 
ceives a failure or success reply followed by inactivity on the back channel 34, the circuit is completely discon- 
nected. The sending controller 18 then starts to transmit soft carriers on the forward channel 32 output. 

18. MONOCAST LOAD BALANCING 

50 

Figure 23 illustrates the paths traversed through the network 14 by a monocast connect command. Two 
load balancing methods are used when routing monocast messages upward through the left half stages of the 
unfolded network 14. (Direct routing to the correct switch node 16 using the tag mapping tables 108 occurs for 
monocast connect commands in the right half stages of the unfolded network 14). A first method chooses the 
55 output port that is directly across from the requesting input port, if this port is available, otherwise the next nu- 
merically higher output port (modulo 8) that is available is chosen. A second method looks at the routing tag, 
and in conjunction with the tag mapping tables 108, routes the connect command to the correct switch node 
16 at the next stage. If the desired port is unavailable the next numerically higher port (modulo 8) that is available 
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is chosen. Both loading balancing methods are designed to evenly distribute requesting input ports to available 
output ports. Thus, message traffic should be evenly distributed in the network 14, especially when the network 
14 is not fully configured. 

19. MONOCAST NON-BLOCKING 

When a routing tag specifies "monocast non-blocking", certain steps are taken if the desired path through 
the network 14 is unavailable. If no output port is available in a switch node 16 in the left half of the unfolded 
network 14, or the selected output port is not available in a switch node 16 in the right half of the unfolded net- 
work 14, then the circuit is backed-off. Back-off occurs only on monocast connect commands. 

The network 1 4 provides the ability to back-off connections in such a way that a completely different connect 
command is tried from the sending controller 1 8 rather than retrying the original connect command in the switch 
node 16. This reduces contention in the network 14. By trying different connect commands, overall performance 
is improved in terms of number of connections per second. This also minimizes any "hot spots," wherein heavy 
traffic in one section of the network 14 impedes the flow in the rest of the network 14 because of a rippling effect. 
A back-off protocol is realized by the network 14 sending a disconnect reply back to the sending controller 18 
on the back channel 34 when an idle reply is expected after the transmission of a connect command on the 
forward channel 32. 

1. MONOCAST BLOCKING 

When a routing tag specifies "monocast blocking", certain steps are taken if the desired path through the 
network 14 is unavailable (back-off is disabled). If no output port is available while load balancing, or If the port 
selected by the tag mapping table 108 is unavailable, then the switch node 16 waits (i.e., does nothing) until 
a suitable port becomes available. This effectively gives the connection a higher priority than connections with 
back-off enabled, since it will get through in the shortest time possible. 

2. MONOCAST BLOCKING WITHOUT LOAD BALANCING 

When a routing tag specifies "monocast blocking without load balancing", the tag mapping table 1 08 is used 
to select a specified output port in each switch node 16. The purpose for this type of message is to route a 
circuit deterministically, and is typically used in conjunction with a test command or test reply for a selected 
switch node 1 6. It may also be used by other diagnostic routines. 

3. MONOCAST PIPELINE 

The network 14 also provides a pipeline protocol wherein the sending controller 18 is allowed to send the 
data portion of the messages after a predetermined number of clock cycles after it has sent the connect com- 
mand, i.e., the sending controller 18 need not wait for the receiving controller 18 to "acknowledge" (in the form 
of back channel 34 idle replies) the connect command. Immediately after the sending controller 1 8 has trans- 
mitted the connect command, it starts to count wait cycles; at a terminal count value, the sending controller 18 
begins sending data. The terminal count value is chosen such that if the connect command is not backed off, 
data transmitted at the terminal count value should catch up with the connect command just after the connect 
command reaches the receiving controller 18. 

On the other hand, if the connect command is backed-off before it reaches the receiving controller 1 8, the 
switch node 16 that could not allocate any output port sinks the connect command and any subsequent data, 
and sends a disconnect (failure) reply followed by nulls out its back channel 34 to the sending controller 18. 
The disconnect (failure) reply disconnects the partially established circuit as it is propagated back through the 
intervening switch nodes 16 to the sending controller 18. The sending controller 18 recognizes the disconnect 
failure reply as indicating that the connection has been backed off. The sending controller 1 8 then sources soft 
carriers at its forward channel 32 output, and proceeds to a different connection, queueing the original connec- 
tion for later retry. 

If the receiving controller 18 sees the connect command, then the circuit has not been backed-off by the 
network 14. At this point, the receiving controller 1 8 has the option of backing-off the circuit by responding with 
a failure reply and discarding data bytes, or it may complete the circuit by responding with idle replies on its 
back channel 34. The (idle or failure) reply is propagated back through the network 14 toward the sending con- 
troller 18. A failure reply indicates that the circuit has been backed-off. An idle reply indicates to the sending 
controller 18 that the circuit is complete. 
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23. MONOCAST NON-PIPELINE 

A non-pipelined protocol is used when back-off is disabled (back-off is disabled for high-priority messages). 
The non-pipelined protocol for monocast connect commands is similar to the pipelined protocol, with one dif- 

5 ference. Instead of counting wait cycles before sending data, the sending controller 18 first waits for the idle 
replies from the receiving controller 18 to arrive, then it begins transmitting data (the stream of idle replies in- 
dicates a circuit has been established). If the sending controller 1 8 were to transmit data before this, then there 
is a chance that the data will arrive at a switch node 1 6 that has not yet forwarded the connect command, which 
would result in data loss and a network 14 error. At the end of the data block, the sending controller 18 sends 

10 a disconnect (success) command (in this case, since it already received idle replies, it can send this command 
immediately following data). The remainder of the protocol is identical to the pipelined protocol. 

24. CONTROLLER SOFTWARE 

15 Software executing in the controller 1 8, and more specifically in the SPARC™ microprocessor 56, provides 

a numberof different functions, including configuration of the network 14 to support the flexible cabling, services 
to the PM 12 for the logical use of the network 14, dynamic load balancing of traffic on multiple networks 14, 
and automatic fault recovery so that a network 14 may continue operation in a degraded mode, i.e., when a 
network 14 is not fully connected because a fault has caused at least one PM 12 to be inaccessible and when 

20 one or both networks 14 have a single fault that is reconfigured around leaving the network 14 fully connected. 

The software is based around a set of tasks as shown in Figure 24, including a device interface 156, a 
RxDispatcher 158, a TxDispatcher 152, and a timer 154. The device interface 156 handles communications 
with the operating system. The RxDispatcher 158 handles the building and completion of incoming network 14 
connections. The TxDispatcher 152 handles the building and completion of outgoing network 14 connections. 

25 The timer 1 54 handles timeout events, blocking and unblocking of connections, and other time related functions. 

Generally, requests from the operating system are processed by the device interface 156 and queued up 
in the TxSend queue 1 64 until they are transmitted to the network 14. Circuits blocked physically by the network 
14, i.e., backed off circuits, are put in the TxSend queue 164. Circuits blocked logically, e.g., due to congestion 
in the receiving controller 18, are put on the Blocked queue 168 for retransmission after a time-out. Circuits 

30 that are completed are queued on the TxDone queue 166 for completion processing by the TxDispatcher 152. 
A circuit completion may do many things, but commonly will return a response to the PM 12. 

Generally, the RxDispatcher 1 58 receives the header of any incoming circuit. Once a header is received, 
the RxDispatcher 158 builds the appropriate channel program to receive the remainder of the circuit. When the 
circuit is complete, the RxDispatcher 158 may do many things, but will commonly return a request to the PM 

35 12 as unsolicited input or a response to the PM 12 that matches a previously received request. 

25. SUPERCLUSTERS 

Superclusters are partitions of the network 14 comprising logically independent subsets of PMs 12. Com- 

40 munication within one subset is prevented from interfering with communication in any other subset, yet com- 
munication between superclusters is maintained. This is implemented using supercluster partitioning rules. 

In a network 14 of a given size, there are many subnetworks 14 that can operate independently of one an- 
other. Thus, there is a means to support the supercluster concept. The rules for forming superclusters are rel- 
atively flexible and simple. To be precise, however, it is useful to define some notation. 

45 Assume there are N network I/O ports 20 in a network 14 and thatm= [log 2 N] f wherein [log 2 N] indicates 

a ceiling function providing the smallest integer not less than log 2 N. If n = [ logs N ] and N is a power of 8, then 
m = 3n. The binary address of a PM 12 (which is its Level or network I/O port 20 address in the network 14 ) 
can then be represented as Y={y m . y,y 0 } t y x e {0,1}, 0£i<m. The relationship between V and X, the octal 
representation, is simply Xj = { y 3 j+2y3j+iy3j) . 0 <>j < n. 

so The basic rule is that any group of PMs 12 whose binary addresses agree in p high order bits may form a 

supercluster of size 2 m -P and there can be 2p superclusters of that size in the network 14. In the preferred em- 
bodiment, however, a supercluster may not be smaller than the switch node 16 size, e.g., 8. The binary address 
of each supercluster is {y^\ym-2 — ym- P }» i- e -» tne number formed by the p high order bits. In supercluster num- 
ber 0, for example, every PM 12 will have an address of the form {0 m _ 1 0 m _ 2 ...0 m _ p y m _ p _ 1 ...y 1 y 0 }. Because each 

55 supercluster is logically embedded in a subnetwork, the rules can be applied recursively to subdivide super- 
clusters into smaller clusters. Thus, any network 14 can contain a multiplicity of superclusters of different sizes. 
However, each size is a power of two and none smaller than eight. If these rules are followed, the communi- 
cations among the PMs 1 2 in one supercluster will not overlap and/or deadlock with the communications among 
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PMs 12 in any other supercluster. 

For networks 14 in which N is not a power of two, the PMs 12 are equally allocated to the address blocks 
that are a power of two in size with no address block smaller than the switch node size. This is controlled by 
where a given PM 12 is plugged into the network 14. For example, if there are 50 PMs 12 and they are to be 
5 split into two superclusters of 25 PMs 12 each, the first group should be plugged into the network 14 where 
the decimal addresses range from 0 to 31 and the second group where the decimal addresses range from 32 
to 63. 

In general, a list is made of the sizes desired and the smallest power of two not less than each size (or 
eight whichever is larger) is computed. To compute the size of the network 14 needed, the powers of two just 

10 computed are summed and the smallest power of two not less than that sum is calculated. Next, the network 
14 is divided in half recursively as needed until there is a section which is the size of each power of two that 
was calculated for each supercluster. The PMs 12 in each supercluster are assigned to addresses in the cor- 
responding range. If this is done, each PM 12 in a cluster meets the addressing constraints. 

As an example, assume a network 14 is to be configured with five superclusters of sizes 75, 25, 30, 20, 

15 and 10 PMs 12. The corresponding powers of two are 128, 32, 32, 32, and 16. The sum of these is 240, so a 
network 14 of size 256 is needed. This is divided into the address ranges 0-127 and 128-255. The group of 75 
PMs 12 is assigned anywhere in the range 128-255. The remaining free addresses are each split in half again 
to yield groups 0-63 and 64-127. These are too big, so they are both split yielding 0-31 , 32-63, 64-95, and 96- 
1 27. The 25 PM 1 2 su percluster can be assigned to any of these address groups, e.g. , 0-31 . Similarly, the group 

20 of 30 PMs 12 is assigned to 32-63 and the group of 20 PMs 12 is assigned to 64-95. The remaining group of 
10 could be assigned anywhere in the range 96-127, but it is best to pick one half or the other. Another group 
of up to size 16 could be added later, or, those boards could be depopulated from the network 14. The last 
supercluster is assigned to 96-1 1 1 . 

Both networks 14 in a system are partitioned in equivalent ways. That is, even though the PMs 12 in a su- 

25 percluster do not have to be assigned to the same block of addresses in each network 14, they must be in a 
valid partition of the same size in each network 14. For simplicity it is recommended that the address blocks 
be made the same in each network 14. 

Superclusters can be created in existing systems 1 0 without recabling the PMs 12 if each PM 1 2 is plugged 
into the same address in each network 14, as long as the address rules are followed. It may be, however, that 

30 one supercluster has more PMs 12 than another due to the way in which the PMs 12 are attached to the network 
14. To avoid this problem, the PMs 12 should be evenly distributed among the network I/O ports 20. 

26. MULTICAST 

35 Figure 25 illustrates the paths traversed through the network 14 by a multicast connect command. The tag 

mapping table 108 is used to select a single output port for switch nodes 16 in the left half of an unfolded network 
14, thereby steering the multicast connect command and routing tag to a particular bounce-back point 30, i.e., 
a particular input port on a particular switch node 16 in the highest numbered stage in the left half of the unfolded 
network 14. The multicast port select register 130 is used to select a set of output ports for switch nodes 16 in 

40 the right half of the unfolded network 14 

All PMs 12 within a supercluster use the same bounce-back point 30 for multicasts. A multicast from one 
supercluster to another uses the bounce-back point 30 of the destination supercluster. These rules prevent 
deadlock by permitting only one multicast at a time within a supercluster partition. In addition, circuits in different 
supercluster partitions cannot interfere with each other. 

45 No more than one multicast bounce-back point 30 is located at the input of any switch node 16 (which limits 
the number of superclusters to the number of switch nodes 16 in the highest numbered stage of the network 
14). Furthermore, no two multicast commands are routed through the same switch node 16 in the right half of 
the unfolded network 14 (which effectively prohibits supercluster adress sizes that are smaller than the size of 
a switch node 1 6). These rules simplify the switch node 16 design because the 8-input arbiter 98 in each switch 

so node 1 6 does not need to allow for multiple simultaneous multicast commands. 

Multicasts are disjoint, i.e., no supercluster is a subset or superset of another supercluster. This rule pre- 
vents deadlock by preventing one multicast from interfering with another multicast. 

A PM 12 can establish a connection to all other PMs 12 by multicasting to a supercluster comprised of all 
PMs 12. In such a case, only one PM 12 can use the network 14 at a time. 

55 The multicast capability allows the PMs 12 to be easily coordinated, especially with respect to their status. 
The alternative would require point-to-point messages to be sent to every PM 12 and the coordination of status 
information received in reply. 

Multicasts could easily "hog" all available bandwidth, so a mechanism, called a multicast window, is in- 
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eluded that regulates entry of multicast connections to the network 14 in such a way as to maintain high per- 
formance of multicast and point-to-point messages. Because multicasts are regulated, they are given priority 
over point-to-point messages by the switch nodes 16. In addition, the switch nodes 16 "remember" that there 
was recent multicast traffic and wait for a timed period to see if more multicasts are forthcoming before allowing 
5 point-to-point traffic to resume. 

There are two major elements required by the multicast protocol that are not part of the monocast protocol: 

1. The connection is fully handshaked in that an idle reply is received by the sending controller 18 followed 
by a count reply. The idle reply is the result of a merge of all idle replies from all the receiving controllers 
18 to which the sending controller 18 is connected. The count reply is a merge of all count replies from 

10 each receiving controller 18 transmitting a one count. The result is the quantity of receiving controllers 18 

that have received the multicast. 

2. An ACK reply from the receiving controller 18 is required after transmission of the header and also after 
the data is transmitted. A disconnect success command is not transmitted by the sending controller 18 until 
the ACK reply has been received. 

15 To establish a multicast circuit, the sending controller 18 first transmits a connect command and routing 

tag which indicates the selected bounce-back point 30. Each switch node 16 directs the connect command to 
the selected bounce-back point 30. Once the connect command reaches the selected bounce-back point 30, 
the output port or set of output ports to be selected is determined from the multicast port select register 130 in 
the switch node 16. In this case, paths are established from a receiving input port to the entire set of selected 

20 output ports. If some of the selected output ports are currently unavailable, the input state control 56 in the IPL 
36 waits until all of the selected output ports become available. The connect command and the routing tag are 
then copied to each of the selected output ports and subsequently propagated to the next stage of switch nodes 
16. This process is repeated stage by stage, until the connect command reaches ail of the designated receiving 
controllers 18. 

25 Each receiving controller 18 then responds with a count reply followed by idle replies on the back channel 

34. When there is more than one receiving controller 18, merge logic 120 in each switch node 16 combines 
the incoming replies on the back channel 34. After a reply decode 110 decodes the reply, the output state control 
96 signals the merge logic 120 that it is ready to participate in the reply merge process. If all selected back 
channels 34 fail to become ready to merge, then the sending controller 18 detects the failure via timing out. 

30 When the merge logic 120 detects that all selected back channels 34 are ready to merge, it enables the clocks 
to ail participating back channel FIFOs 1 14 simultaneously. 

There are two types of results from the merge logic 120, i.e., low sort and add. The type is specified by the 
reply. In low sort mode, the merge logic 120 outputs the reply with the lowest key followed by the accompanying 
data word. In add mode, the data words of all replies are added in a bit serial fashion so that the result has the 

35 same number of bits as the operands. The input state control 56 sets the back channel output mux 74 to select 
the output of the merge logic 120 to be connected to the back channel 34 output. 

Stage by stage, the switch nodes 16 merge the back channel 34 replies and send the results through their 
back channels 34 to the next stage until a single merged count reply is derived and passed back to the sending 
controller 18 by the switch node 16 in stage 0 in the right half of the unfolded network 14. The sending controller 

40 18 verifies that it has received idle replies and checks the count reply (the count reply indicates the total number 
of receiving controllers 18 that have received the connect command). 

If the count is correct, the sending controller 18 begins sending the data portion of the message, followed 
by a checksum command and idle commands; the sending controller 18 then waits for a merged ACK reply 
from all receiving controllers 18. The transmitted data and command follows the paths that the connect com- 

45 mand created previously (branching out to multiple ports just as the connect command did) until they have 
reached all designated receiving controllers 18. 

Each receiving controller 18 individually checks its data with the checksum command it has received. If 
correct, each then responds with an ACK reply, followed by idle replies. Just as the count reply did, the ACK 
replies from all receiving controllers 1 8 are merged, stage by stage, until a single merged ACK reply, followed 

so by merged idle replies, is propagated back to the sending controller 18. 

The sending controller 18, which had been waiting for this merged ACK reply to indicate that all receiving 
controllers 1 8 have verified their data, then sends a disconnect (success) command followed by idle commands. 
As before, this command and subsequent idle commands are propagated forward through the network, to all 
of the receiving controllers 18. Each receiving controller 18 then responds with a disconnect (success) reply 

55 to disconnect the circuit 

The disconnection proceeds in a fashion similar to that of the monocast case. However, switch nodes 16 
that are connected to more than one output port wait for the disconnect reply from all selected output ports 
before proceeding. As each switch node 16 merges the disconnect reply, it disconnects its paths and sends 
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the merged reply followed by nulls out of its back channel 34 to the switch node 16 in the previous stage. Its 
output ports, when disconnected, individually source soft carriers through their respective forward channels 32 
toward the receiving controllers 18. This disconnection procedure is repeated, stage by stage, until finally a 
single merged disconnect (success) reply followed by nulls reaches the sending controller 18, indicating a suc- 
5 cessfully received message and completed circuit disconnection. The sending controller 18 then sources soft 
carriers through its forward channel 32 towards the switch node 16 in stage 0. 

27. FORWARD CHANNEL COMMANDS 

10 The following section describes a protocol which defines the different command types that can be trans- 
mitted through the network 14. The commands are transmitted on forward channels 32 through the network to 
receiving controllers 18. In the descriptions of each command, the byte count does not include the command 
escape byte. 

A Soft Carrier command consists of two bytes and is processed both by switch nodes 16 and controllers 
15 1 8. The second byte is the one's complement of the first byte which facilitates error detection during idle periods 

by ensuring that all data bits and the parity bit toggle. If there is no circuit present when this command is received 

by a switch node 16, the command decode 52 causes the input state control 56 to discard the command by 

clocking the next item out of the input FIFO 46. If there is a circuit present or pending, the switch node 16 sets 

the forward channel 32 loss error bit in the input status register 66. 
20 Whenever a given output port is not part of a circuit, the output state control 96 sets the output mux 102 

to select the command generator 106 and the command generator 106 transmits the soft carrier command, 

thereby maintaining the soft carrier on the output port. 

A Success command consists of one byte and is used by both the switch nodes and network controllers 

18. It is a one byte command. If a circuit is currently active, the switch nodes 16 interpret it as a disconnect 
25 and pass it on. 

A Failure command consists of one byte is used by both the switch nodes 16 and controllers 18. It is a one 
byte command. If a circuit is currently active, the switch nodes 16 interpret it as a disconnect and pass it on. 

. An Idle command consists of two bytes and is processed both by switch nodes 16 and controllers 18. The 
second byte is the one's complement of the first byte which facilitates error detection during idle periods by 

30 ensuring that all data bits and the parity bit toggle. If there is no circuit present when this command is received 
by a switch node 1 6, then the soft carrier loss error bit is set in the input status register 66. If there is a circuit 
pending (Connect command received but not yet forwarded), then the command decode 52 causes the input 
state control 56 to discard the command by clocking the next item out of the input FIFO 46. If there is a circuit 
present, then the switch node 16 passes this command on to the next stage. 

35 Whenever a given output port or its associated input port detects an error while a circuit is present, the 

input state control 56 sets the data select mux 60 to select the command generator 64 and the command gen- 
erator 64 transmits the Idle command, thereby maintaining the circuit without propagating the error. 

A Test command consists of one byte and is used by both the switch nodes 16 and the controllers 18. It 
provides a mechanism for testing command error processing throughout the network 14. Switch nodes 16 that 

40 have been enabled to respond to this command complement the parity bit on the byte or bytes following this 
command. 

A Chip-Address- Is command consists of one byte and is followed by two bytes that contain the chip address 
and the port number of the sending controller 18. It is only issued while a DP 140 session is active. This com- 
mand is stored in the command/data latch 48 upon receipt. The local DP 140 is notified and reads the latch 48 

45 via the diagnostic port logic (DPL) 122. It is used to acknowledge receipt of data. This command is nominally 
used only by the switch nodes 16. (It is also used by the controllers 18 for testing purposes only.) Upon receipt, 
the IPL 36 in a switch node 16 stores the chip address in the tag latch 50. This command is issued by a switch 
node 16 upon receipt of an Escape Reply with a Send-Chip-Addr key field. It is used as part of the process to 
determine the topology of the network 14. 

so A Connect command consists of one byte and is followed by the two byte routing tag. This command is 
processed by both switch nodes 16 and controllers 18. When this command is received by a switch node 16, 
the command decode 52 causes the input state control 56 to store it into the command/data latch 48 and, on 
subsequent clock cycles, it stores the following two bytes into the tag latch 50. 

A Checksum command consists of one byte and is followed by two bytes. This command is used exclusively 

55 by the controllers 18. If a circuit is currently active, the switch nodes 16 pass it on as though it were data. If a 
circuit is not currently active, then the command decode 52 sets a command error flag in the input status register 
66. 

A Send-Port-Address command consists of one byte and is used by a controller 18 to obtain its location 
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in the network 14 from the switch node 16 to which it is directly attached. The port address is returned via the 

back channel 34 using an Escape Reply with a Port-Addr-ls key field. This command may also be issued by 

switch nodes 16, or received by any switch node 16 or controller 1 8 for diagnostic purposes. 

A Resync command consists of one byte and is used by controllers 18 or switch nodes 16 to request that 
5 the recipient transmit null bits on the back channel 34 for as long as this command is continuously received. It 

is issued by controllers 18 or switch nodes 16 when their back channel 34 inputs lose framing synchronization 

as a result of an error detected on the back channel 34. 

A DP-Ack command consists of one byte and is issued by a controller 18, via the switch node 16 to which 

it is directly connected, to the local DP 140. It is only issued while a DP 140 session is active. This command 
10 is stored in the command/data latch 48 upon receipt. The local DP 140 is notified and reads the latch 48 via 

the Diagnostic Port Logic (DPL) 122. It is used to acknowledge receipt of data. 

28. BACK CHANNEL REPLIES 

15 The following section describes a protocol which defines the different reply types that can be transmitted 

through the network 14. The replies are transmitted in back channels 32 through the network to sending con- 
trollers 18. 

There are two types of replies encoded in an open back channel 34, i.e., short and long. Both types begin 
with a four bit reply code followed by a parity bit. A long reply follows this with an eight bit key field, a second 
20 parity bit covering all previous bits, the first 8 bits of a 16-bit data field, a third parity bit covering all previous 
bits, the second 8 bits of the 16-bit data field, and a fourth parity bit the covers the entire reply. Thus, the short 
reply is five bits and the long reply is 32 bits. 

A Test reply is short. This command provides a mechanism for testing reply error processing throughout 
the network 14. Switch nodes 16 that have been enabled to respond to this reply complement the parity on the 
25 reply and then forward it. 

A Success reply is short. If a circuit is currently disconnect pending, the switch nodes 16 interpret it as a 
disconnect reply and pass it on. Receipt of this reply clears a fail bit in the input status register 66. 

A Failure reply is short. If a circuit is currently disconnect pending, the switch nodes 16 interpret it as a 
disconnect reply and pass it on. Receipt of this reply sets the fail bit in the input status register 66. 
30 An Idle reply is short. Once a circuit is established, this reply is sent whenever there is nothing else to be 

sent by a controller 18. 

A Pause reply is short. Once a circuit is established, this reply is sent by controllers 18 and is merged by 
switch nodes 16. It is compatible with the idle reply. If both are present, then the result is a pause reply. 

ACK (Acknowledge), NAK (Negative Acknowledge), and NAP (Not A Participant) replies are short. A SAK 
35 (Status Acknowledge/Unsuccessful) reply is long. The NAK response may be of several different types, indi- 
cating a locked, error or overrun condition. The NAP reply typically is used as an indication that the processor 
does not have the resources to process the message meaningfully, i.e. not applicable processor. 

Once a circuit is established, these replies are sent by controllers 1 8 and merged by switch nodes 1 6. These 
replies are compatible with Idle, and among themselves, for merge operations. Only the Idle reply is discarded, 
40 the others remain in the back channel FIFO 114 until a merge operation is complete; the winner is the reply 
with the highest priority. The SAK reply has the highest priority, followed by the NAK, ACK, and NAP replies. 
The NAP and SAK replies may also be merged with the Sense and Sum replies. Table V shows the results of 
merge operations, based on the type of reply. 

With regard to the SAK reply, it typically denotes the readiness state of a processor module with regard to 
45 a particular task or transaction. Such SAK replies provide, in conjunction with the back channel 34 merge op- 
erations of the network 14, means for providing a globe status report for a given task or transaction. Because 
the SAK replies may be sorted according to priority, the lowest data content response automatically gains pri- 
ority and establishes the least ready status as the global system state. The SAK replies are also used in con- 
junction with various other operations, such as system initialization and lockout operations. 
so If two or more back channel FIFOs 114 contain a SAK reply, then the SAK reply with the lowest value wins 

In the merge. The data field of the SAK reply contains the PM 12 number, which is passed through the back 
channel 34 with the winning key. If two or more keys are equal, then the lowest PM 12 number is returned. 

A Sense reply is long. Once a circuit is established, this reply is sent by controllers 1 8 and merged by switch 
nodes 1 6. The lowest value wins in a merge. The data field contains the PM 1 2 number which is passed through 
55 the back channel 34 with the winning key. If two or more keys are equal, then the lowest PM 12 number is re- 
turned. This reply is merged with the NAP and SAK replies. 

A Count reply is long. Once a circuit is established, this reply is sent by controllers 18 and merged by the 
switch nodes 16. The key and data fields are bit serially added with other Count replies in the merge logic 120 
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in a switch node 16. The result is truncated to 20 bits. This operation assumes that least significant bits are 
transmitted first. This reply is merged with the NAP and SAK replies. The Sum reply is not merged with any 
Count reply. 

An Escape reply is nominally issued only by switch nodes 16. The key field is used to specify one of the 
5 following seven options: Available, Not-Available, Send-Chip-Address, Port-Address-Is, DP-Attention, DP-Da- 
ta, and DP-Checksum. These replies are all long. 

An Available key is used only by switch nodes 16 in the left half of an unfolded network 14 when there is 
no circuit present. The data fields are not used. Upon receipt, the OPL 38 in each switch node 1 6 clears a Temp- 
Not-Avail status flag in the output status register 92. This reply and key are issued only as needed, not on a 
10 continuous basis. When a circuit is pending, there is no need to issue this reply and key. Switch nodes 16 that 
do not perform load balancing never use this reply and key. 

A Not-Available key is used only by switch nodes 16 in the left half of an unfolded network 14 when there 
is no circuit present. The data fields are not used. Switch nodes 16 that do not perform load balancing never 
use this reply and key. It can be issued after a circuit has been disconnected and as part of the initialization 
15 procedure. It is issued only if all output ports are either permanently unavailable (because nothing is connected) 
or temporarily unavailable (due to receipt of this reply from an upstream switch node 16), or if the input port 
becomes unavailable due to the detection of an error condition. Upon receipt, the OPL 38 in each switch node 
16 sets a Temp-Not-Avail status flag in the output status register 92. This reply and key are issued only when 
the status of the switch node 16 changes, not on a continuous basis. Once a circuit is pending, this reply and 
20 key are not issued. At that point the switch node 16 is committed to completing the circuit. Switch nodes 1 6 in 
stage 0 do not issue this reply and key. 

A Send-Chip-Addr Key is used by a switch node 16 to ask the switch node 16 at the other end of the back 
channel 34 to transmit its chip address via the forward channel 32. The data field is set to 0. The Chip-Addr- 
Is command is used to supply the requested chip address. This reply and key are issued only if the hard carrier 
25 is present on the back channel 34. It is used as part of the process to determine the topology of the network 
14. It may also be sent or received by the controllers 18 for diagnostic purposes. 

A Port-Address-Is Key is issued by a switch node 16 to a controller 18. The data fields contain the value 
in the chip address register 121 and port number register 72 in the switch node 1 6. This reply and key are issued 
only as a result of receiving the Send-Port-Addr command. This reply and key may also be received by a switch 
30 node 16 for diagnostic purposes. In such a case, the switch node 16 stores ail the reply fields, but otherwise 
ignores the reply. A DP 140 may then read the fields by means of the DPL 122. 

A DP-Attention Key is used by the diagnostic processor to indicate that a response is available for the con- 
troller 18. It is sent by the local switch node 16 at the next available opportunity when no circuit is present. 
A DP-Data Key is used to move data from the DP 140 to the controller 18 while a DP session is active. 
35 A DP-Checksum Key is used to move a checksum from the diagnostic processor to the controller 18 while 

a DP session is active. 

A Ctl-Error reply is long. It is issued by the controllers 18 to report errors. The key field contains an error 
code and the data field contains the PM number. No merge take place when this reply is received. If a circuit 
is present, it is queued up in the back channel FIFO 1 14 (overwriting whatever was there) and transmitted on 

40 a first-come-first-served basis. If no circuit is present or pending, then this reply is ignored. 

A Net-Error reply is long. It is issued by the switch nodes 16 to report errors. The key field contains an error 
code and the data field contains the chip address and port number. No merge takes place when this reply is 
received. If a circuit is present, it is queued up in the back channel FIFO 1 14 (overwriting whatever was there) 
and transmitted on a first-come-first-served basis. If no circuit is present or pending, then this reply is ignored. 

45 The Ctl-Error and Net-Error replies are only used to try to directly inform the sending controller 18 of a prob- 
lem. In any event, other error reporting mechanisms are used to ensure the system logs and responds to the 
error condition. 

29. NETWORK APPLICATIONS 

50 

Figure 26 illustrates one possible application of the present invention, which comprises a general purpose 
multiprocessor computer system 10 capable of massively parallel processing. The system 10 consists of com- 
mon platform PMs 12, which are functionally differentiated by software and I/O subsystems, communicating 
with each other by means of the redundant networks 14. 
55 The functional differentiation of processing determines the PM 12 type. One PM 12 type, the access module 
processor (AMP), manages portions of the relational database. Another PM 12 type, the application processor 
(AP), executes an operating system version which provides the processing environment visible to end-users. 
An AP consists of one or more CPUs based on the INTEL® 486 CPU, memory 22 and I/O capabilities in- 
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eluding controllers 18. Each AP may include multiple CPUs and includes at least two controllers 18 for con- 
nection to two networks 14. Each AP executes an independent copy of the operating system, preferably UNIX™, 
and maintains an independent address space. If a single AP contains multiple CPUs, they may be tightly-cou- 
pled by sharing a single copy of the operating system and a common address space. 

5 An AMP is similar to the AP, but includes greater disk I/O capabilities and has little or no communication 

other than disk interfaces and network controllers 18. Each AMP executes an independent copy of the operating 
system, which is specifically designed for executing the database software. Each AMP is allocated a portion 
of the database for which it is responsible. The database software executing therein maintains separate buffers, 
logging, recovery and database management capabilities. AMPs are not overtly aware of the existence of other 

10 AMPs and are normally unaffected by the operations of the other AMPs. AMPs receive requests from the APs 
and perform the operations on their portion of the database. Depending on the request, the AMPs may return 
simple acknowledgements, results of an aggregation function, or a relation back to the requesting AP. The 
AMPs have the capability to work in parallel with other AMPs and to perform parallel operations internally. 
The computer system 10 provides a single operational view for on-line transaction processing (OLTP) so 

15 that end users do not know which AP they are physically communicating with or which AP is actually executing 
their applications. All APs have the potential to be included in a common pool for these services. OLTP software 
routes transactions and responses as necessary, depending on which AP is capable of providing specific ser- 
vices. 

The computer system 10 is based on client-server processing concepts. This allows other host environ- 
20 ments, such as IBM MVS™, DEC VMS™, and other UN IX™ systems, to act as clients to the databases managed 
by the computer system 10. Client software running in the host environment communicates with the APs. For 
example, in IBM environments, DB2™-style SQL clients can be used to access the database on the computer 
system 10. The database executes partially in the APs, but mostly in the AMPs. The AP portion includes soft- 
ware to process SQL statements into a form required by the AMP. This processing includes parsing SQL into 
25 parallel steps, optimization, AMP message generation, and AMP workload dispatching. 

30. CONCLUSION 

This concludes the description of the preferred embodiment of the invention. In summary, the present in- 
30 vention comprises a multistage interconnect network 14 (MIN) capable of supporting massively parallel proc- 
essing, including point-to-point and multicast communications between processor modules (PMs) which are 
connected to the input and output ports of the network. The network 14 is built using interconnected 8x8 switch 
nodes 16 arranged in 2 [ log 8 N ] stages, wherein N is the number of input/output ports in the network and [log 8 
N ] indicates a ceiling function providing the smallest integer not less than log 8 N. The additional stages provide 
35 additional paths between network input ports and network output ports, thereby enhancing fault tolerance and 
lessening contention. 

The described embodiment provides numerous advantages. One advantage is reliability. The system is 
designed to keep working even when components fail by automatically reconfiguring itself when a fault is de- 
tected. 

40 Still another advantage is serviceability. The error reporting method isolates ,faults to prevent them from 

propagating throughout the network. 

Still another advantage ismanufacturability. For large system configurations, cabling could be very unman- 
ageable. However, the design of the present invention, along with flexible cable connection rules, make the 

problem tractable for large systems and nonexistent for small systems. 
45 Still another advantage is simple installation. Any processor can be plugged into any available receptacle. 

This eliminates a source of errors by dropping the need to make sure every cable is plugged into the correct 

connector. All other systems we know of have this cabling constraint. 

Still another advantage is high performance per processor. The high connectivity topology, extra stages 

of switch nodes, back-off capability, pipelining operation, back channel, and multicast window features combine 
so to provide a high speed connection capability for each processor regardless of the number of;processors in 

the system. In other systems, the bandwidth available to each processor tends to decrease as the system size 

grows. 

31 TABLES 

55 

The following tables are Tables I to V referred to hereinabove in the description of the preferred embodi- 
ment 
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TABLE I 

Incremental Growth at Discontinuities 



15 



20 



N 


CMA 


Type-A 


% 


64-* 65 


1-2 


8-10 


12.5 


128-129 


2-3 


16-19 


12.5 


192-193 


3-4 


24-28 


12.5 


256-257 


4-5 


32-37 


12.5 


320-321 


5-6 


40-46 


12.5 


384-385 


6-7 


48-55 


12.5 


448-449 


7-8 


56-64 


12.5 



Type-B 


%_ 


Type-D 


%_ 


0- 5 


(«) 


2- 3 




8-10 


12.5 


4- 5 




12-14 


8.0 


6- 7 




16-19 


12.5 


8-10 


12.5 


20-23 


10.0 


10-12 


10.0 


24-28 


12.5 


12-14 


8.0 


28-32 


11.0 


14-16 


7.0 



25 



TABLE II 



Number of Type-A Boards Versus Number of PMs if 50% of 
the Maximum Bandwidth is Maintained (#PMs < 65) 
30 # Type-A 

#PMs Boards 



35 



40 



45 



50 



2 1 

3-8 2 

9-18 3 

19-32 4 

33-40 5 

41-48 6 

49-56 7 

57-64 8 
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10 



15 



20 



25 



30 



TABLE III 

Number of Type-A Boards in the least populated CMA/A 
Versus Number of PMs if 100% of the Maximum Bandwidth is 
Maintained (#PMs < 129) 

# Type-A 

*PMs Boards 

2-4 2 

5-9 3 
10-16 4 
17-25 5 
26-36 6 
37-49 7 
50-64 8 

TABLE IV 

8x8 Chip Address Register Bit Definition 



35 



40 



45 



50 



BIT 
c 10 c 9 

C 8 C 7 C$ 

C5C4C3 
c 5 c< 

C 2 C 1 C 0 
C 3 C2C A 

CO 



MEANING 
Right/Left 
Stage # 
Cabinet # 
CMA/A # 
CMA/B # 
Type-A board # 
Type-B/C board # 
Type-B/C level 



DERIVATION 
Hardwired on board 
Hardwired on board 
DP supplies it 
DP supplies it 
DP supplies it 
Hardwired in backplane 
Hardwired in backplane 
Hardwired on board 
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TABLE V 

Back Channel Merge Logic Results 





Man 




wax 


bax 


Sense 


Count 


Sum 


Nap 


Nap 


Ack 


Nak 


Sak 


Sense 


Count 


Sum 


Ack 


Ack 


Ack 


Nak 


Sak 


Err 


Err 


Err 


Nak 


Nak 


Nak 


Nak 


Sak 


Err 


Err 


Err 


Sak 


• Sak 


Sak 


Sak 


Sak 


Sak 


Sak 


Sak 


Sense 


Sense 


Err 


Err 


Sak 


Sense 


Err 


Err 


Count 


Count 


Err 


Err 


Sak 


Err 


Count 


Err 


Sum 


Sum 


Err 


Err 


Sak 


Err 


Err 


Sum 



Claims 

1. A communications system, including a multistage interconnect network (14) comprising a plurality of inter- 
connected active logic switch nodes (16), characterized by diagnostic means for detecting and reporting 
any errors that occur within the network (14), and for isolating the errors without propagating them; and 
reconfiguration means for reconfiguring the network (14) when an error is detected, without interrupting 
communications in the system. 

2. A system according to claim I, characterized in that said reconfiguration means includes means for recon- 
figuring routing tables and input and output enabie vectors in the network (14), so that all communications 
are routed around a faulty section of the network (14). 

3. A system according to claim I or 2, characterized by at least one redundant network (14) for providing added 
bandwidth and fault tolerance to the system, and control means for switching between the redundant net- 
works (14) when a failure occurs, so that if one or more of the redundant networks (14) is not available, 
another can take over. 

4. A method for communicating in a multistage interconnect network (14) comprising a plurality of intercon- 
nected active logic switch nodes (16), characterized by the steps of: detecting and reporting any errors that 
occur within the network (14), and isolating the errors without propagating them, and reconfiguring the net- 
work (14) when an.error is detected, without interrupting communications in the system. 

5. A communications system, comprising a multistage interconnect network (14) comprising a plurality of in- 
terconnected active logic switch nodes (16), each switch node (16) comprising a first plurality of input ports 
(36), a second plurality of output ports (38), and means for selectively connecting said input ports (36) to 
said output ports (38), characterized in that the multistage interconnect network (14) comprises more than 
pog b N] stages of switch nodes (16) wherein b is a total number of switch node input/output ports, N is a 
total number of network input/output ports (20), and [iog b N] indicates a ceiling function providing the small- 
est integer not less than log 5 N, the stages thereby providing a plurality of paths between any network input 
port (20) and network output port (20); by diagnostic means for detecting and reporting any errors that 
occur within the network (14) and by reconfiguration means for reconfiguring the network (14), without in- 
terrupting the communications in the system, when an error is detected, and for isolating the error without 
propagating it. 
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6. A system according to claim 5, characterized in that the switch nodes (16) further comprise tag mapping 
table means, associated with each input port of a switch node (16) for interpreting a routing tag to determine 
which output port (38) of the switch node (16) to select in order to route correctly a connect request through 
the network. 

7. A system according to claim 6, characterized in that the tag mapping table means comprises a memory 
array with a plurality of entries for translating the routing tag to an output port selection, wherein the array 
provides a one-to-one mapping between a logical port selection provided by the routing tag and a physical 
port selection. 

8. A system according to claim 5, wherein the switch nodes (16) further comprise input enable vectors for 
indicating which input ports (36) of the switch node (16) are operational, and output enable vectors for in- 
dicating which output ports (38) of the switch node (16) are operational. 

9. An apparatus for concurrently transferring messages between different ports, comprising: multistage in- 
terconnect network means (14) for interconnecting a plurality of switch nodes (16) for communication there- 
between, each switch node (16) comprising a first plurality of input ports (36), a second plurality of output 
ports (38), and means for selectively connecting said input ports (36) to said output ports (38), character- 
ized by dynamic configuration means for determining how the switch nodes are interconnected and hence 
a topology of the multistage network means (14) whereby messages can be routed correctly between the 
switch nodes. 

10. Apparatus according to claim 9, characterized in that the dynamic configuration means further comprises: 
means for communicating addresses between switch nodes (16), wherein one switch node (16) communi- 
cates its location to another switch node (16) connected thereto; and tag mapping table means, in each 
switch node (16), for storing routing information derived from the addresses of the switch nodes (16), so 
that messages can be routed correctly between the switch nodes (16). 

11. A apparatus according to claim 9 or 10, characterized by initializing means for querying the switch nodes 
(16) at start up to determine how they are interconnected and for generating the tag mapping table means 
in response thereto. 

12. A method for communicating in a network (14) comprising a plurality of switch nodes (16), each switch node 
(16) having a first plurality of input ports (36) a second plurality of output ports (38), and path selector means 
for selectively connecting the switch node input ports (36) to the switch node output ports (38), character- 
ized by the steps of (a) interconnecting the switch nodes (16) in a relatively arbitrary manner to effect a 
multistage interconnect network (14); (b) determining how the switch nodes (16) have been interconnected; 
(c) constructing routing tables for each switch node (16) according to how the switch nodes (16) have been 
interconnected; and (d) transferring messages through the network (14) according to the routing tables, 
wherein each switch node (16) that receives the messages uses a routing table to determine which output 
port (38) should receive the message. 

13. A method according to claim 12, characterized in that said determining step comprises: communicating 
addresses between the switch nodes (16), so that a topology for the network (14) can be determined; and 
storing routing information derived from the addresses of the switch nodes (16), so that messages can be 
routed correctly through the network (14). 

14. A communications system, comprising: network means (14) for providing bidirectional data transmission 
between agents (12) connected thereto, the network means comprising switch nodes (16) connected to- 
gether in a multistage interconnect network, each switch node comprising a first plurality of input ports 
(36), a second plurality of output ports (38), and means for selectively connecting said input ports to said 
output ports, characterized by self-diagnosing means, integrated with the network (14), for detecting and 
reporting any errors that occur within the network (14). 

15. A system according to claim 14, characterized in that the self-diagnosing means comprises diagnostic proc- 
essor means (I40) for monitoring the state of the network means (14), for performing self-tests on the com- 
ponents of the network means (14), and for initializing the network means (14) said diagnositic processor 
means (I40) comprising means for configuring routing tables and input and output enable vectors for the 
network means (14) so that fault conditions in the network means (14) can be bypassed. 
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16. A system for concurrently transferring messages between different ports (20) characterized by: a plurality 
of switch nodes (16), each switch node (16) comprising a first plurality of input ports (36), a second plurality 
of output ports (38), and means for selectively connecting said input ports (36) to said output ports (38); 
means for connecting the switch nodes (16) together in a multistage interconnect network (14) the means 
for connecting comprising forward channel (32) and back channel (34) signal paths; multicast means, op- 
erative within the network (14), for transmitting forward channel messages from a source to one or more 
destinations; and back channel merge means (I20), within each switch node (16), for combining back chan- 
nel replies received from the destinations into a single result, wherein the result is transmitted on the back 
channel (34) to the source. 

17. A system according to claim 16, characterized in that the multicast means comprises means for steering 
a multicast request for a supercluster to a bounce back point (30) within the network means (14), wherein 
all multicast requests to the supercluster use the same bounce back point (30). 

15 18. A system according to claim 16 or claim 17, characterized in that the multicast means comprises means 
for permitting only one multicast message at a time within a supercluster, thereby preventing deadlock 
between competing multicast requests. 
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19. A method for concurrently transferring messages between different ports of multistage interconnect net- 
work (14), the network comprising a plurality of switch nodes (16), each switch node (16) comprising a first 
plurality of input ports (36), a second plurality of output ports (38), and means for selectively connecting 
said input ports (36) to said output ports (38), the switch nodes (16) being connected together via forward 
channel (32) and back channel (34) signal paths connected to every input and output port (36,38), char- 
acterized by the steps of: (a) transmitting forward channel messages from a source to one or more des- 
tinations; and (b) combining back channel replies received from the destinations into a single result, where- 
in the result is transmitted on the back channel (34) to the source. 

20. A communications system, comprising: network means (14) for providing bidirectional data transmission 
between a plurality of agents (12) connected thereto, the network means (14) comprising switch nodes (16) 
connected together in a multistage interconnect network (14), each switch node (16) comprising a first plur- 
ality of input ports (36), a second plurality of output ports (38), and means for selectively connecting said 
input ports (36) to said output ports (38), characterized by means for arranging components within the net- 
work means (14) and interconnections between switch nodes (16), so that the number of agents (12) can 
be increased with only a substantially linear increase in the size of the network means (14). 

35 21. A system according to claim 20, characterized in that the means for arranging comprises: first means for 
constructing the network means (14) using only a first board type (I70) when the network means (14) con- 
tains 8 or fewer network I/O ports; second means for constructing the network means (14) using only a 
first board type (I70) and a second board type (I76) when the network means (14) contains between 9 and 
64 network I/O ports; third means for constructing the network means using only a first board type (I70), 

40 a second board type (I76) and a third board type (I72) when the network means (14) contains between 65 

and 5I2 network I/O ports; and fourth means for constructing the network means (14) using only a first board 
type (I70), a second board type (I76), and a fourth board type (I74) when the network means (14) contains 
between 5I3 and 4096 network I/O ports. 

45 22. A system for communicating in a variety of modes between one or more transmitting agents (12) and one 
or more receiving agents (12), comprising: a multistage interconnect network (14) intercoupling all the 
agents (12) with transmit and receive lines, the network (14) comprising a plurality of switch nodes (16) ar- 
ranged in parallel groupings in a plurality of stages within the network (14), characterized in that the trans- 
mitting agents (12) include means for transmitting message routing packets containing destination data 

so designating one or more receiving agents (12) and means for transmitting variable length data messages; 

and in that the switch nodes (16) each comprise means responsive to the message routing packets for se- 
lecting node-to-node paths to one or more receiving agents (12), wherein the switch nodes (16) also com- 
prise means for acknowledging receipt of a message routing packet and means for preserving the circuit 
connection for subsequent transmission of a variable length message. 



55 



23. An active logic network (14), comprising: a plurality of switch nodes (16), each having a plurality of input 
ports (36) and a plurality of output ports (38), wherein within the switch node (16) any input port (36) can 
be connected to any output port (38) and any output (38) can be connected to any input port (36); means, 
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within each switch node (16), for accepting a routing tag at an input port (36) identifying a destination in 
the network (14); and routing determination means, within each switch node (16), for translating the routing 
tag into an output port selection of the switch node (16), thereby identifying a path to the destination. 

A network according to claim 23, characterized in that the switch node (16) further comprises: means for 
accepting a connect request designating a desired destination to which a connection is requested; selector 
means (58), connected to the means for accepting, for selecting an output port (38) of the switch node 
(16) according to the desired destination; and arbiter means (98), connected to the port selector means 
(58) and associated with each output port (38), for transmitting the connect request to the selected output 
port (38), the arbiter means (98) choosing between contending connect requests when more than one port 
selector means (58) is trying to access the selected output port (38) at the same time, wherein the connect 
request is transmitted to the selected output port (38) when it has been granted priority over the contending 
connect requests. 

A network according to claim 24, characterized in that the arbiter means (98) further comprises: means 
for servicing contending point-to-point connect requests on a "round robin" basis; and means for providing 
priority to multicast connect requests over point-to-point connect requests. 

A system for transmitting messages between agents (12) selectively in (I) an arbitrary interconnection mode 
or (2) a multicast mode, characterized by a multistage interconnect network (14) comprising a plurality of 
multiple terminal bidirectional switch nodes arrayed in a plurality of stages, the agents (12) each including 
means for generating addressing messages containing destination data, including alternative descriptors 
for individual and multicast group designations, and each being coupled to the switch nodes (16); and 
wherein the switch nodes (16) further include means responsive to the descriptors in the addressing mes- 
sages for selecting routing paths through the network (14) and means responsive to path selection for es- 
tablishing path commitments linking agents (12) for communication of variable length messages. 

A network for communicating between agents (12) connected thereto, characterized by packet switching 
means for establishing a communication path between sending and receiving agents (12) in response to 
a connect request; circuit switching means for transferring messages of arbitrary length between the send- 
ing and receiving agents (12) once the communication path has been established; pipelining means for 
transferring messages between sending agents and the receiving agents (12) without waiting for the com- 
munication path to be established; back-off means for cancelling the connect request when the commu- 
nication path to trie receiving agent (12) is unavailable; and retry means, triggered by the back-off means, 
for delaying a retry of the connect request, and for trying a different connect request from the sending agent 
(12), thereby reducing contention in the network (14). 

A communication system, characterized by: network means (14) coupled to the agents (12), for providing 
bidirectional data transmission between a plurality of agents (12) connected thereto, the network means 
(14) comprising switch nodes (16) connected together in a multistage interconnect network, each switch 
node (16) comprising a first plurality of intput ports (36), a second plurality of output ports (38), and means 
for selectively connecting said input ports (36) to said output ports (38); and cabling means for wiring be- 
tween different stages in the network means (14) with one or more copies of a universal wiring pattern. 

A system according to claim 28, characterized in that the universal wiring pattern comprises a permutation 
of switch node ports (36,38) that swaps the least significant two base b digits of a level number represen- 
tation, where b is a total number of switch node input or output ports (36,38). 

A system for concurrently transferring messages between different ports (20), comprising a plurality of 
switch nodes (16), each switch node comprising a first plurality of input ports (36), a second plurality of 
output ports (38) and means for selectively connecting the input ports (36) to the output ports (38); and 
means for interconnecting the switch nodes (16) in a multistage interconnect network (14) having a first plur- 
ality of network input ports (20) and a second plurality of network output ports (20), characterized by par- 
titioning means for grouping the network ports (20) into logically independent subsets, wherein each subset 
is a supercluster; and multicast means, operative within the network (14) for transmitting a message from 
a network input port (20) to one or more network output ports (20) grouped in a supercluster, wherein mes- 
sages transmitted within any one supercluster are prevented from interfering with messages transmitted 
within any other supercluster. 
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31. A system according to claim 30, characterized in that the partioning means further comprises: means for 
grouping network ports into a supercluster of size 2 m - p, wherein binary addresses for the grouped network 
ports are indentical in p highest order bits thereof, thereby providing 2p superclusters of size 2 m P in the 
system, wherein m = [log 2 N], N is a total number of network input/output ports, and pog 2 N] indicates a 
ceiling function providing the smallest integer not less than log 2 N; and means for addressing each super- 
cluster as OVi, Ym.2, ...Ym-p), wherein Vj e (0,l) and m-l<i£m-p. 

32. A method for concurrently transferring messages between different ports (20) of a multistage interconnect 
network (14) having a plurality of interconnected switch nodes (16) characterized by the steps of: grouping 
the network ports (20) into logically independent subsets, wherein each subset is a supercluster; and trans- 
mitting a message from a network input port (20) to one or more network output ports (20) grouped in a 
supercluster, wherein messages transmitted within any one supercluster are prevented from Interfering 
with messages transmitted within any other supercluster. 

33. A method according to claim 32, characterized in that the grouping step further comprises: grouping net- 
work ports (20) into a supercluster of size 2 nr> -p t wherein binary addresses for the grouped network ports 
(20) are identical in p highest order bits thereof, thereby providing 2p superclusters of size 2 nv p in the sys- 
tem, wherein m = [log 2 N], N is a total number of network input/output ports (20), and [log b N] indicates a 
ceiling function providing the smallest integer not less than log b N; and addressing each supercluster as 
(Y^Yf^ Y^p), wherein Y, e (0,l) and m-l < i £ m-p. 

34. A communication system, comprising: network means (14) for providing bidirectional data transmission be- 
tween a plurality of agents (12) connected thereto, the network means (14) comprising switch nodes (16) 
connected together in a multistage interconnect network (14), each switch node (16) comprising a first plur- 
ality of input ports (36), a second plurality of output ports (38), and means for selectively connecting said 
input ports (36) to said output ports (38), characterized by master clock means for su pplying a master clock 
signal to the switch nodes (16) throughout the network means; and controller means, coupled between the 
network means and the agents (12) for communicating messages between the agent (12) and the network 
means (14), the controller means comprising clock extraction means for deriving the master clock signal 
transmitted throughout the network means (14), wherein the master clock signal increments a counter in 
the controller means to provide a synchronous date and time of day to each agent. 

35. A communication system, including a network (14) comprising at least two input ports (20) and two output 
ports (20), and capable of simultaneous communications between a different pair of input and output ports 
(20) destination for the message, and for steering the connect request to the output port (20) destination, 
characterized by back-off means, within the network (14) for cancelling the connect request when the output 
port (20) destination is unavailable; and retry means, triggered by the back-off means, for delaying a retry 
of the connect request, and for trying a different connect request between the input port (20) and a different 
output port (20) destination, thereby reducing contention in the network (14). 

36. A system according to claim 35, characterized by a plurality of switch nodes (16), each comprising a first 
plurality of input ports (36), a second plurality of output ports (38), and means for selectively connecting 
said input ports (36) to said output ports (38); and tag mapping table means, associated with each input 
port (36) of a switch node (16) to select in order to route correctly the connect request through the network 



37. A communication method for a network (14) comprising at least two input ports (20) and two output ports 
(20), and capable of simultaneous communications between a different pair of input and output ports, char- 
acterized by the steps of: accepting a connect request containing a routing tag identifying an output port 
destination for the message, and steering the connect request to the output port destination; cancelling 
the connect request when the output port destination is unavailable; and delaying a retry of the connect 
request, and trying a different connect request from the input port (20), thereby reducing contention in the 
network (14). 

38. A computer system, characterized by a plurality of processors comprising application processors (AP) for 
connecting external devices to the system, and access module processors (AMP) for managing access 
to a relational database; network means (14), coupled to the processors (12), for providing bidirectional data 
transmission between the processors (12), the network means (14) comprising switch nodes (16) connected 
together in a multistage interconnect network, each switch node (16) comprising a first plurality of input 
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ports (36) to said output ports (38), the network (14) having more than (log b N) stages of interconnected 
switch nodes (16), wherein b is a total number of switch node input/output ports (36,38), N is the number 
of network input/output ports (20), and log b N indicates a ceiling function providing the smallest integer 
not less than log b N. 

5 

39. A computer system according to claim 38, characterized in that the AP further comprises one or more mi- 
croprocessors, including memory and at least two connections to the network (14), for executing an inde- 
pendent copy of the operating system and maintaining an independent address space. 

Q 40. A communication systems, including a network (14) comprising a plurality of interconnected switch nodes 
(16), characterized by identification means for communicating a unique identifier to each agent (12) attached 
to the network (14), so that an agent (12) can be connected to any available port (20) of the network (14) 
and determine its address therein, thereby simplifying installation. 

41. A system according to claim 40, characterized in that the identification means comprises means for de- 
° termining the agenfs address within the network (14) using a level number associated with the port (20) 

to which it is connected, said means for determining comprising means for transmitting a command to the 
network (14), wherein a switch node (16) that receives the command replies thereto with a port address. 

42. A communications apparatus, including a plurality of switch nodes (16) arranged and interconnected into 
20 a multistage interconnect network (14), characterized by error testing means for generating test patterns 

during data transmission through the network (14), the error testing means comprising means, within each 
switch node (16), for inverting parity on the data transmission so that a receiving switch node (16) reports 
an error. 

25 43. Apparatus according to claim 42, characterized in that the error testing means further comprises means 
for forwarding a test command to the receiving switch node (16) and means for inverting parity on data 
which follows the test command. 

44. A switch node (16) providing for the simultaneous interconnection of multiple messages, characterized by 
30 means defining a plurality of input and output ports (36,38); means for selectively connecting any input 

port (36) to either any one output port (38), or a predetermined number of the output ports (38), or all of 
the output ports (38); and error testing means for generating test patterns to test data transmission on the 
input and output ports (36,38), the error testing means comprising means for inverting parity on the data 
transmission so that a switch node (16) receiving the data transmission reports an error. 

35 

45. A switch node according to claim 44, characterized in that the error testing means further comprises means 
for forwarding a test command to the receiving switch node (16) and means for inverting parity on data 
which follows the test command. 

^ 46. A switch node (16) for use in a network (14) having a plurality of similar nodes (16) interconnected by cabling, 
for simultaneous interconnection of multiple messages of different types, including message routing pack- 
ets, characterized by: means defining a plurality of input and output terminals (36,38); means for storing 
reconfigurable mapping data which identifies available interconnection paths using the input and output 
terminals (36,38); and means, coupled to the means for storing, for arbitrarily interconnecting any input 

45 with any one output, a predetermined number of the outputs or all the outputs. 

47. A switch node (16) according to claim 46, characterized in that the means for storing reconfigurable map- 
ping data further comprises means for storing input and output enable vectors indicating which terminals 
(36,38) are operational. 

50 48. A switch node (16) according to claim 46 or claim 47, characterized in that the switch nodes (16) each include 
forward channel and back channel signal paths (32,24) coupled to each of the input and output ports 
(36,38) in each of the switch nodes (16), wherein the back channel signal paths (34) have a narrower band- 
width relative to the forward channel signal paths (32) and by means for synchronously combining replies 
from the back channels (34) so that replies are sorted as they propogate through the switch node (16), 

55 wherein only the reply having a highest priority is transmitted to the next switch node (16) on the back chan- 

nel (34). 
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49. A communications network comprising: a plurality of switch nodes (16) arranged into a multistage inter- 
connect network (14) having a plurality of input and output ports (20), each port (20) being coupled to an 
agent (12) to effect communication between agents through the network characterized by the network (14) 
having a multiple of [log b N] stages of interconnected switch nodes (16), wherein b is a total number of 
switch node input/output ports (36,38), N is the number of network I/O ports (20), and [log b N] indicates a 
ceiling function providing the smallest integer not less than log b N, thereby providing additional paths be- 
tween any network input port (20) and network output port (20) to enhance fault tolerance and lessen con- 
tention; and the network (14) having a loop-back point (30) indicating where the stages of the network (14) 
are physical folded together so that corresponding switch nodes (16) in similarly numbered stages on either 
side of the loop-back point (30) are located adjacent to each other, thereby simplifying packaging and min- 
imizing signal path lengths. 

50. A system for concurrently transferring messages, comprising: a multistage interconnect network (14) com- 
prising a plurality of interconnected active logic switch nodes (16); each switch node (16) comprising a first 
plurality of input ports (36), a second plurality of output ports (38), and means for selectively connecting 
said input ports (36) to said output ports (38), characterized by the multistage interconnect network (14) 
comprising more than [log b N] stages of switch nodes (16), wherein b is a total number of switch node in- 
put/output ports (36,38), N is a total number of network input/output ports (20), and [log b N] indicates a 
ceiling function providing the smallest integer not less than log b N, the multistage interconnect network 
(14) providing a plurality of paths between any network input port (20) and network output port (20) to en- 
hance fault tolerance and lessen contention; and multicast steering means, within each switch node (16), 
for routing multicast requests to a specific input port (36) of a specific switch node (16) within the network 
(14), so that only one multicast request can occur at a time, thereby preventing deadlock between com- 
peting multicast requests. 

51. A system according to claim 50, characterized by means for storing a reply from each network output port 
(20) in a back channel (34); and means for collecting replies from the network output ports (20) and for 
applying the replies to merge means for synchronously combining all of the replies, wherein the replies 
are sorted as they propagate through the merge means, so that only the reply having the highest priority 
is transmitted through the system. 

52. Apparatus for concurrently transferring messages between different ports (20), including a multistage in- 
terconnect network (14) comprising a plurality of interconnected active logic switch nodes (16); each switch 
node (16) comprising a first plurality of input ports (36), a second plurality of output ports (38), and means 
for selectively connecting said input ports to said output ports, characterized by the multistage interconnect 
network (14) comprising more than pog b N] stages of switch nodes (16), wherein b is a total number of switch 
node input/output ports (36,38) N is a total number of network input ports and network output ports (20), 
and [log b N] indicates a ceiling function providing the smallest integer not less than log b N, the multistage 
interconnect network providing a plurality of paths between any network input port (20) and network output 
port (20) to enhance fault tolerance and lessen contention; and deadlock avoidance means, within each 
switch node (16), for allowing only one routing multicast request at a time, thereby preventing deadlock 
between requests. 

53. A communications system, including a plurality of switch nodes (16) arranged into a multistage interconnect 
network (14) having a plurality of input and output ports (20), each port (20) being coupled to an agent (12) 
to effect communication between agents (12) through the network, characterized by the network (14) having 
more than [log b N] stages of interconnected switch nodes (16), wherein b is a total number of switch node 
input/output ports (36,38), N is the number of network input/output ports (20), and [log b N] indicates a ceil- 
ing function providing the smallest integer not less than log b N; and the network (14) having a plurality of 
turnaround points (30) at the highest stage of switch nodes (16), the turnaround points (30) logically dif- 
ferentiating between switch nodes (16) that load balance messages through the network (14) from the switch 
nodes that direct messages to receiving agents (12) means for depopulating switch nodes (16) from the 
highest stage to reduce the number of turnaround points (30) in the network (14), as long as at least one 
path exists between every network input port (20) and every network output port (20); and the input and 
output ports (36,38) of the switch nodes (16) in stages adjacent the highest stage sensing when the switch 
nodes (16) in the highest stage are removed and disabling the input and output ports (20) in response there- 
to, thereby lowering the bandwidth of the network (14) and lowering the cost of the network without a loss 
of functionality. 
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54. A communications system, including a plurality of switch nodes (16), each switch node (16) comprising a 
first plurality of input ports (36), a second plurality of output ports (38), and means for selectively connecting 
said input ports (36) to said output ports (38); and means for interconnecting the switch nodes (16) together 
in a multistage interconnect network (14) to simplify the cabling therebetween, wherein a pattern of inter- 

5 connections between different stages of switch nodes (16) is specified by permuting the digits of the level 

number representing a port of a switch node. 

55. A system according to claim 54, characterized in that the means for interconnecting comprises means for 
connecting switch node ports identified by (S : x^,...^) to switch node ports (36,38) identified by (S+l : 

10 PERMUTE; {x n .|....x,x 0 }), wherein S indicates a switch node in a specific stage, n refers to a total number 

of stages in the network (14), and x is a level number of a switch node port (36,38) represented as (Xn. 
i...X|Xo) b in a base b corresponding to a size of the switch nodes (16), wherein 0 <, xj < b and 0 <> i < n. 

56. A communications apparatus, comprising: a multistage interconnect network (14) comprising a plurality of 
15 interconnected active logic switch nodes (1 6), each switch node (16) comprising a first plurality of input ports 

(36), a second plurality of output ports (38), and means for selectively connecting said input ports (36) to 
said output ports (38), characterized by the multistage interconnect network (14) comprising more than [log b 
N] stages of switch nodes (16), wherein b is a total number of switch node input/output ports (36,38), N is 
a total number of network input/output ports (20), and [Iog b N] indicates a ceiling function providing the 
20 smallest integer not less than log b N, thereby providing a plurality of paths between any network input port 

(20) and network output port (20) to enhance fault tolerance and lessen contention; and load balancing 
means, in each switch node (1 6), for distributing messages among the plurality of output ports (20) so that 
messages are evenly distributed throughout the network (14). 



57. Apparatus according to claim 56, characterized in that said load balancing means comprises: means, with- 
in each switch node (16), for choosing a switch node output port (38) similarly numbered as a requesting 
switch node input port (36) when the similarly numbered switch node output port is available; and means, 
within each switch node (16), for choosing a next available switch node output port (38) when the similarly 
numbered switch node output port (38) is not available. 

58. A communication system, comprising: a plurality of switch nodes (16), each switch node (16) comprising a 
first plurality of input ports (36), a second plurality of output ports (38), and means for selectively connecting 
said input ports (36) to said output ports (38), characterized by means for connecting the switch nodes 
(16) together in a multistage interconnect network (14), the means for connecting comprising forward chan- 
nel (32) and back channel (34) signal paths coupled to each of the input (36) and output (38) ports in the 
switch nodes (16), wherein the back channel signal paths (34) have a narrower bandwidth relative to the 
forward channel signal paths (32). 

59. A system according to claim 58, characterized by multicast means, operative within the network (14), for 
transmitting forward channel messages from a source to one or more destinations; and back channel 
merge means (120), within each switch node, for combining back channel replies received from the des- 
tinations into a single result, wherein the result is transmitted on the back channel (34) to the source. 



60. A method of communicating among a plurality of switch nodes (16) connected together in a multistage in- 
terconnect network (14), each switch node comprising a first plurality of input ports (36), a second plurality 

45 of output ports (38), and means for selectively connecting said input ports (36) to said output ports (38), 

characterized by the steps of: 

transmitting messages between switch nodes (16) using a forward channel (32) coupled to each of the input 
and output ports in the switch nodes (16); and transmitting between switch nodes (16) using a back channel 
(34) coupled to each of the input and output ports (36,38) in the switch nodes (16), wherein the back chan- 
50 nels (34) have a narrower bandwidth relative to the forward channels (32) to simplify packaging. 

61. A method according to claim 60, characterized by the steps of transmitting forward channel messages from 
a source to one or more destinations; and combining back channel replies received from the destinations 
into a single result, wherein the result is transmitted on the back channel (34) to the source. 

55 

62. A communications apparatus, including multistage interconnect network means (14) for interconnecting a 
plurality of switch nodes (16) for communication therebetween; dynamic configuration means for determin- 
ing how the switch nodes (16) are interconnected by the multistage interconnect network means (14); and 
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means for remapping connections between the switch nodes so that all connections in a backplane con- 
necting the switch nodes are horizontal. 

63. A multiprocessor system, including a plurality of processors (12) and a network (14) for interconnecting the 
processors (12), characterized in that the network (14) comprises a plurality of bounce-back points (30) for 
directing the transmission of messages through the network (14), and means for communicating between 
processors (12) through the network (14) via the bounce-back points (30). 

64. A multiprocessor according to claim 63, characterized in that said network (14) comprises: a plurality of 
switch nodes (16) arranged into a multistage interconnect network (14) having a plurality of input and output 
ports (20), each port (20) being coupled to an agent (12) to effect communication between agents (12) 
through the network (14); the multistage interconnect network (14) comprising more than pog b N] stages of 
switdrnodes (16), wherein b is a total number of switch node input/output ports (36,38), N is a total number 
of network input/output ports (20), and 0og b N] indicates a ceiling function providing the smallest integer 

15 not less than log b N, the multistage interconnect network (14) providing a plurality of paths between any 

network input port (20) and network output port (20) to enhance fault tolerance and lessen contention; and 
the network (14) having a plurality of bounce-back points (30) at the highest stage of switch nodes (16), the 
bounce-back points (30) logically differentiating between switch nodes (16) that load balance messages 
through the network (14) from switch nodes (16) that direct messages to receiving agents (12). 
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65. A multiprocessor system, including a plurality of processors (12) and a network for connecting the proces- 
sors (12), characterized in that said network (14) has a plurality of bounce-back points (30) for directing the 
transmission of messages through the network (14); and means for partitioning the processors (12) into one 
or more superculsters comprising logically independent subsets of processors (12); and by means for mul- 
ticast communicating within superclusters by steering a multicast message to a particular bounce-back 
point (30) in the network for subsequent routing to the processors (12) in the supercluster, thereby pre- 
venting deadlock in the network (14) by permitting only one multicast message through the particularly 
bounce-back point (30) at a time and preventing multicast messages to different superclusters from inter- 
fering with each other. 
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